Q1. Construct a phylogenetic tree by hand from the data given below. 

The following distances are found among four sequences M, N, O, and P.




A. Draw all the possible rooted and unrooted trees for the four sequences M, N , O, and P. ANSWER: There are only three possible unrooted trees for four taxa. Note that branches can be rotated about nodes without affecting the hierarchy of the tree. The three trees are:
There are 15 possible rooted trees. The trees can be obtained from the three unrooted trees above by adding a root at any of the 5 edges in the unrooted tree, thus giving 15 trees.
B. Calculate the tree and branch lengths using the UPGMA method. 
ANSWER: The unweighted pair group method with arithmatic mean (UPGMA) method assumes a constant rate of evolution, which simplifies the branch length calculations. For example, suppose two taxa A and B are separated by 12 changes (a + b = 12), then by the UPGMA a = b = 12/2 = 6. The equivalent value indicates that the two taxa have undergone a constant rate of evolution since their divergence, such that at any given time, the two sequences should have the same number of changes separating them from the common ancestor.

For this example, we know that there are 12 changes between O and P, so o + p = 12 and o = p = 12/2 = 6.

Similarly, m + n = 15, n = m = 15/2 = 7.5

The only value remaining is q, which must be derived by first determining the average distance separating any taxon on one side of the root from any taxon on the other side. That is, by calculating the average distance for MO, MP, NO, and NP.

(MO + MP + NO + NP)/4 = (26 + 28 + 29 + 31)/4 = 28.5

Because a constant rate of evolution is assumed, a root can be placed directly at the central location of the joining branch, which will distribute half of the above average leaf-root-leaf distance of 28.5 (14.25 from the root to any leaf) equally between the two sides:

M           N      O          P      /\
    \         /            \        /        |
 7.5 \    / 7.5       6 \    / 6       |
         \/                    \/            |
        MN                OP          |   14.25
                \            /                |
           6.75 \      / 8.25           |
                      \/                      |
                    MNOP              \/


C. Calculate the tree and branch lengths using the Neighbor-joining method.
ANSWER:

            The following Dij are found among four sequences M, N, O, and P in first step.
                Dij = d_ij - (ri+rj), where ri = 1/(n-2) Sum_m (dim)

r_M
69/2
r_N
75/2
r_O
67/2
r_P
71/2
           

N
O
P
M
-57
-42
-42
N

-42
-42
O


-57
 
        Calculate edge lengths within new cluster k as follows: dik = 1/2(dij + ri - rj), where i and j are the
        clusters to be merged.
        Pick MN to merge, and let new node be k. Distance of N to k = 1/2(15 + (75 - 69)/2) = 9
                 then distance of M to k is 15-9 = 6
        Finally calculate distance of MN to O and P via formula: dkm = 1/2(dim+djm-dij)

        Now calculate new matrix:

O
P
MN
20
22
O

12

     New r values are:
r_MN
42
r_O
32
r_P
34

    New Dij values are:

O
P
MN

-16
O
-17
-21

Therefore we merge OP next, to get k. Calculating edge distances: dOk = 1/2(12+32-34) = 5 and
    dPk = 12-5 = 7
Finally distance from OP to MN is = 1/2(20+22-12) = 15

Thus the final tree is as follows:

 M           N      O          P
    \         /            \        /
    6 \    / 9          5 \    / 7
         \/                    \/
        MN                OP
                \            /
             7.5 \      / 7.5
                      \/
                    MNOP
Since distance between MN and OP is 15 we divide the length into equal parts to get 7.5 on each.
D. What assumptions are made by the above methods regarding mutation rates in the different branches of the evolutionary tree?
ANSWER: UPGMA assumes a constant mutation rate for all branches (or molecular clock), whereas the Neighbor-joining method allows for a variable rate of mutation from the ancestral state.
Q2. Draw the 3 alternative unrooted trees for the data whose nucleotides at a postion under consideration are T,T,C, and C. Label each of the internal nodes with the most likely label based on weighted (match=0, mismatch=1) and unweighted parsimony.
ANSWER:
While there are three possible unrooted trees, in this case two of them are the same. We get


 UNWEIGHTED                          WEIGHTED: (x,y) denotes cost of C and T
C     C     T      T                         C     C     T      T
|        |       |        |                          |        |       |        |
--------      --------                          --------      --------
   C              T                              (0,2) C         (2,0) T
     |               |                                   |               |
     -------------                                    -------------
         (CT)                                            (1,1) i.e.,C or T
1 mutation

C     T     C      T                        C     T     C      T
|        |       |        |                          |        |       |        |
--------      --------                          --------      --------
(CT)          (CT)                            (1,1) C      (1,1) C (since root is C)
     |               |                                   |               |
     -------------                                    -------------
           CT                                             (2,2) C or T (say C)
2 mutations

Q3. Given the following sequences: TCAA, TTGA, TTAG, CTAA, CCGT. Construct the most parsimonious tree (unweighted).
ANSWER:
                                            TTGA    TTAG    TCAA    CTAA    CCGT
                                                |                |                |            |                |
                                                |                |                |            |                |
                                                --------------                |            --------------
                                              TT(AG)(AG)              |        C(CT)(AG)(AT)
                                                         |                        |                    |
                                                         |                        |                    |
                                                         ---------------------                   |
                                                            T(CT)AA                          |
                                                                        |                              |
                                                                        |                              |
                                                                        --------------------------
                                                                               (CT),CT,AA

There are a total of 7 mutations


Q4. Compute the Likelihood of the following tree:
                                    r
                          t=1 /      \    t=2
                              /           \
                              i           CG                 
                           /    \    
            t=2      /          \  t=1
                    CT        AT

Use the maximum likelihood approach. You may assume that any one of the 4 bases (A,C,G,T) is equally likely to appear at any position and you may assume that any base is equally likely to be replaced by any other (including itself). For handling time, make the assumption that if P(x,y) is the probability of replacing base x with base y, i.e., then the probability of substitution over time t is given as P(x,y)t (i.e., P raised to power t).

ANSWER: P(x) = 1/4 and P(x,y) = 1/4 (since a given base can change into 4 other bases). Thus P(x,y)2 = 1/16.
L(T) = L(pos1) x L (pos2)
L (pos1)  = sum_r sum_i P(r) x P(r,i) x P(r,C)2 x P(i,C)2 x P(i,A)
Since all bases have same P(x,y) values, we get
L (pos1) = 16 (1/4 x 1/4 x 1/16 x 1/16 x 1/4)  = 1/1024
Likewise L (pos2) = 1/1024
Thus L(T) = 1/1024 x 1/1024 = 9.54 x 10-7

Q5. FASTA uses a lookup table as a rapid way to find common letters and words in the same order and of approximately the same separation in two sequences. Produce a lookup table for single amino acids (i.e., k-word size, k=1) in the following two protein sequences, and then explain how this information will be used to determine what the alignment should be.
query: ACNGTSCHQE
sequence: GCHCLSAGQD
ANSWER:

The following table gives the common offsets for each shared amino acid in sequence 1 and sequence 2. Note that when an amino acid appears multiple times, all combinations must be considered.

The lookup table indicates three common offsets with values 0, 3, and 5. Each common offset indicates a potential alignment, shown below:

    sequence 1: ACNGTSCHQE
C S Q << offset = 0
sequence 2: GCHCLSAGQD

sequence 1: ACNGTSCHQE---
G C << offset = -3
sequence 2: ---GCHCLSAGQD

sequence 1: ACNGTSCHQE-----
CH << offset = -5
sequence 2: -----GCHCLSAGQD

The next step in the FASTA algorithm is to locate the offsets that contain sequence positions that are separated by a maximum distance (32 for 1-mers, 16 for 2-mers). Therefore, all three possible offsets in this example would be included for further analysis, since each sequence identity is separated by less than 32 characters. The regions that have the highest density of identities are then identified (offset = 0 has the highest with three identities) and assigned a score (called INIT1) using a scoring matrix chosen by the user (e.g., BLOSUM62). Our example ends here; however, further analysis is done on longer sequences that may have multiple common offsets separated by gaps. Multiple INIT1 regions in the same sequence pair are summed and penalized for gaps to produce a new score (called INITN). Those sequences with the highest INITN scores are assigned "optimized" scores by performing a local Smith-Waterman local alignment in the INITN region. The OPT and INITN scores are used to rank the matches between the query sequence and the database sequence. Finally, as part of the output process, a full Smith-Waterman local alignment is performed on the query and database sequence and the score is reported.


Q6. For protein database searches, the BLASTP algorithm first makes a list of three-letter words in the query sequence and then scores these words for matches with themselves and with all other possible words using the BLOSUM62 scoring matrix. The 50 highest scoring matches are kept. Database sequences are then scanned for matches to these high-scoring words, and if such are found, then a local alignment is made with the query sequence by dynamic programming.
  1. Suppose that the three-letter word HFA is in the query sequence, what is the log odds score of a match of HFA with itself?
  2. Scan through the table and find the highest scoring match with H (say amino acid X, where X is not equal to H). What would be the score for HFA in our query sequence matching XFA in the database sequence?
  3. Scan again and find the worst match(es) with H (say amino acid Y). What is the score for a match of HFA with YFA?
  4. Repeat the last two questions for the second and third letters in HFA.
  5. How many possible matches are there with HFA? (BLASTP uses approximately the best 50.)
  6. How many words will be used in a search starting with a query sequence that is 300 amino acids long?
Answer:
  1. Matching HFA in a query sequence to itself will yield the highest possible score, 18 half-bits.
  2. The highest scoring match with H will always be H; the next highest scoring match is Y with a score of 2. If the remaining 2 positions are conserved, the three-letter word YFA will have a BL62 score of 12 half-bits.
  3. The lowest scoring match with H is -3, obtained from the 4 amino acids C, I, L, and V. The three-letter word (CILV)FA will have a BLOSUM62 score of 7 half-bits
  4. In Position 2, the highest-scoring match is Y with a score of 3, where HYA is 15 half-bits.

    In Position 2, the lowest-scoring match is P with a score of -4, where HPA is 8 half-bits.
    In Position 3, the highest-scoring match is S with a score of 1, where HFS is 15 half-bits.
    In Position 3, the lowest-scoring match is W with a score of -3, where YFW is 11 half-bits.

  5. Since there are 20 possible characters that can appear opposite each position in the three-letter word, there are 20 x 20 x 20 = 203 = 8000 possible matches to HFA. BLASTP used the top 50 scoring words.
  6. A sequence of 300 aa contains 298 consecutive three-letter words. Assuming BLASTP selects the top 50 matches for each position, there will 298 x 50 = 14,900 words to search for in each database sequence.


Q7. Perform the following exercise with PSI-BLAST. Note that NCBI keeps upating the BLAST server Web pages, so that the currently available server may not match the examples given here and in the text. An introduction to PSI-BLAST is available here.

PSI-BLAST is a version of the BLAST algorithm that uses the results from an initial search for similar protein sequences to construct a type of scoring matrix that can then be used for additional rounds of searches, called iterations. The variability found in each column of the scoring matrix allows additional sequences that have different combinations of amino acids in the sequence positions to be found. The algorithm provides a rapid but less precise search than other methods because the scoring matrix produced is only approximate and includes most of the original query sequence. (Caution: The iterations can lead to more sequences being added that do not share a region in common with the original query sequence, but share a totally different region in some of the added sequences; e.g., these new sequences are not true family members but foreigners.) The process will stop when no more sequences are found. The user can control the number of sequences to be included at each iteration or else use the score cutoff recommended by the program. The method is often used to perform a rapid and preliminary search for members of a sequence family. The found sequences can then be multiply aligned by other better-defined methods.

We provide a protein sequence of a recently found DNA polymerase called iota that replicates past sites of DNA damage and makes mutations.

>gi|5739300|gb|AAD50424.1|AF151691_1 DNA polymerase iota [Mus musculus]
MEPSHARAAGSSRAVCSQGPPTQISSSRVIVHVDLDCFYAQVEMISNPELKDRPLGVQQKYLVVTCNYEA
RKLGVRKLMNVRDAKEKCPQLVLVNGEDLSRYREMSYKVTELLEEFSPAVERLGFDENFVDLTEMVEKRL
QQLPSEEVPSVTVFGHVYNNQSVNLHNIMHRRLVVGSQIAAEMREAMYNQLGLTGCAGVAPNKLLAKLVS
GVFKPNQQTVLLPESCQHLIHSLNHIKEIPGIGYKTAKRLEVLGINSVHDLQTFPIKTLEKELGIAIAQR
IQQLSFGEDKSPVTPSGPPQSFSEEDTFKKCSSEVEAKAKIEELLSSLLTRVCQDGRKPHTVRLVIRRYS
DKHCNRESRQCPIPSHVIQKLGTGNHDSMPPLIDILMKLFRNMVNVKMPFHLTLMSVCFCNLKALSSAKK
GPMDCYLTSLSTPAYTDKRAFKVKDTHTEDSHKEKEANWDCLPSRRIESTGTGESPLDATCFPKEKDTSD
LPLQALPEGVDQEVFKQLPADIQEEILSGKSRENLKGKGSLSCPLHASRGVLSFFSTKQMQASRLSPRDT
ALPSKRVSAASPCEPGTSGLSPRSTSHPSCGKDCSYYIDSQLKDEQTSQGPTESQGCQFSSTNPAVSGFH
SFPNLQTEQLFSTHRTVDSHKQTATASHQGLESHQGLESRELDSAEEKLPFPPDIDPQVFYELPEEVQKE
LMAEWERAGAARPSAHR

This is a mouse homolog of a yeast gene called RAD30. Submit the sequence to PSI-BLAST searching the nr (non redundant) Genpro database. Use the given (default) options of the program. Repeat the search for an additional iteration using the cutoff scores recommended by the program.

  1. How many matches were found above the cutoff score after the initial search?
  2. Using the Web links provided, identify some of the highest scoring sequences. What classes of organisms do the matched genes originate from? Is this sequence representative of a protein family found in just a few or many organisms?
  3. How many additional matches were found after the first iteration, and do most appear to be the same type of function; e.g., DNA repair or replication?
ANSWERS:
  1. The initial BLAST search located 103 matches above the cutoff score.
  2. Many classes of organisms are represented by the initial BLAST query, including mammals (humans, mice), invertebrates (Drosophila, C. elegans), plants (Arabidopsis), and several bacteria (E. coli, M. tuberculosis, N. meningitidis). The broad range of species with which this sequence shares similarity suggests that the function provided by this family of proteins is essential for life.
  3. Approximately 259 total matches above the threshold were located after the first iteration of PSI-BLAST. Most of the results appear to have functions related to DNA repair or replication.

Q8. Assuming energy of -1 for an HH pair (non-neighbor, within distance of 1):
                            P-----H-----H       H-----P
                                               |        |          |
                            H      H-----P       P        P
                             |        |                 |          |
                             P      P-----H-----P        P
                             |                                    |
                             P-----H-----H-----P-----P
    A. What is the energy of the above conformation?
    B. What is the conformation (in terms of UDLR) for this peptide?
    C. Can you give a different conformation for the same peptide that has higher energy?

ANSWER:
    A. Energy = -4
    B. Assuming H is the starting point: DDRRRRUUULDDLLURULL
    C. Actually I meant to say better energy, which would mean a lower value. Here is one possible
    conformation with lower value:
    P-----H
    |
    P-----H-----H-----P-----P-----P
                                                |
    P-----H-----H      H-----P-----P
                       |        |
             H-----P       P
              |                 |
             P-----H-----P
   Energy is -5

Q9. Write a perl script/C++ program to  parse PDB file 2igd, and from the 3D atomic coordinates of the alpha-carbon atoms, construct a distance matrix. Using a threshod of 7, construct a contact map and show it.
Label by hand the main secondary structural elements (show the alpha, beta and loop regions), along with the sequence positions where they begin and end.

contact map

Q10 (Bonus). Prove that if a tree is ultrametric, that is, it is additive and all the distances from the root to any leaf are equal, then the 3-point condition holds for any three leaves A,B, and C. The three point condition states that two of the three possible pairwise distances are equal and not less then the third. That is, one of the following must be true:
1) d(A,B) = d(A,C) and d(A,B) >= d(B,C), or
2) d(A,B) = d(B,C) and d(A,B) >= d(A,C), or
3) d(A,C) = d(B,C) and d(A,C) >= d(A,B)

ANSWER: Without loss of generality let's consider the following scenario among the three leaves:
                                    R
                                   /  \
                                 /       \
                               I            \
                             /  \              \
                          /        \              \
                         A        B            C

Case I: IR = 0 (or I=R), then it is clear that AB=AC=BC
Case II: IR > 0:
        by the ultrametric property it is true that AC = BC (since distance from root to any leaf is equal)
        so we only need to show that AB < AC
        by ultrametric property BI+IR = RC, and since IR > 0, we have BI < RC
        it follows that BI < RC + IR
        it follows that BI+AI < (RC+IR) + AI (adding AI to both sides)
        this implies that AB < AI+IR+RC = AC. qed.