Q1. Construct a phylogenetic tree by hand from the data given below.
The following distances are found among four sequences M, N, O, and P.
- Draw all the possible rooted and unrooted trees for the four sequences
M, N , O, and P.
- Calculate the tree and branch lengths using the UPGMA method.
- Calculate the tree and branch lengths using the Neighbor-joining method.
- What assumptions are made by the above methods regarding mutation
rates in the different branches of the evolutionary tree?
Q2. Draw the 3 alternative unrooted trees for the data whose nucleotides
at a postion under consideration are T,T,C, and C. Label each of the internal
nodes with the most likely label based on weighted (match=0, mismatch=1) and
unweighted parsimony.
Q3. Given the following sequences: TCAA, TTGA, TTAG, CTAA, CCGT. Construct
the most parsimonious tree (unweighted).
Q4. Compute the Likelihood of the following tree:
r
t=1 / \
t=2
/
\
i
CG
/ \
t=2
/ \ t=1
CT AT
Use the maximum likelihood approach. You may assume that any one of the
4 bases (A,C,G,T) is equally likely to appear at any position and you may
assume that any base is equally likely to be replaced by any other (including
itself). For handling time, make the assumption that if P(x,y) is the probability
of replacing base x with base y, i.e., then the probability of substitution
over time t is given as P(x,y)t (i.e., P raised to power t).
Q5. FASTA uses a lookup table as a rapid way to find common letters and
words in the same order and of approximately the same separation in two sequences.
Produce a lookup table for single amino acids (i.e., k-word size, k=1) in
the following two protein sequences, and then explain how this information
will be used to determine what the alignment should be.
query: ACNGTSCHQE
sequence: GCHCLSAGQD
Q6. For protein database searches, the BLASTP algorithm first makes a
list of three-letter words in the query sequence and then scores these words
for matches with themselves and with all other possible words using the BLOSUM62
scoring matrix. The 50 highest scoring matches are kept. Database sequences
are then scanned for matches to these high-scoring words, and if such are
found, then a local alignment is made with the query sequence by dynamic
programming.
- Suppose that the three-letter word HFA is in the query sequence, what
is the log odds score of a match of HFA with itself?
- Scan through the table and find the highest scoring match with H (say
amino acid X, where X is not equal to H). What would be the score for HFA
in our query sequence matching XFA in the database sequence?
- Scan again and find the worst match(es) with H (say amino acid Y).
What is the score for a match of HFA with YFA?
- Repeat the last two questions for the second and third letters in
HFA.
- How many possible matches are there with HFA? (BLASTP uses approximately
the best 50.)
- How many words will be used in a search starting with a query sequence
that is 300 amino acids long?
Q7. Perform the following exercise with PSI-BLAST. Note that NCBI keeps
upating the BLAST server Web pages, so that the currently available server
may not match the examples given here and in the text. An introduction to
PSI-BLAST is available here.
PSI-BLAST is a version of the BLAST algorithm that uses the results from
an initial search for similar protein sequences to construct a type of scoring
matrix that can then be used for additional rounds of searches, called iterations.
The variability found in each column of the scoring matrix allows additional
sequences that have different combinations of amino acids in the sequence
positions to be found. The algorithm provides a rapid but less precise search
than other methods because the scoring matrix produced is only approximate
and includes most of the original query sequence. (Caution: The iterations
can lead to more sequences being added that do not share a region in common
with the original query sequence, but share a totally different region in
some of the added sequences; e.g., these new sequences are not true family
members but foreigners.) The process will stop when no more sequences are
found. The user can control the number of sequences to be included at each
iteration or else use the score cutoff recommended by the program. The method
is often used to perform a rapid and preliminary search for members of a
sequence family. The found sequences can then be multiply aligned by other
better-defined methods.
We provide a protein sequence of a recently found DNA polymerase called
iota that replicates past sites of DNA damage and makes mutations.
>gi|5739300|gb|AAD50424.1|AF151691_1 DNA polymerase iota [Mus musculus]
MEPSHARAAGSSRAVCSQGPPTQISSSRVIVHVDLDCFYAQVEMISNPELKDRPLGVQQKYLVVTCNYEA
RKLGVRKLMNVRDAKEKCPQLVLVNGEDLSRYREMSYKVTELLEEFSPAVERLGFDENFVDLTEMVEKRL
QQLPSEEVPSVTVFGHVYNNQSVNLHNIMHRRLVVGSQIAAEMREAMYNQLGLTGCAGVAPNKLLAKLVS
GVFKPNQQTVLLPESCQHLIHSLNHIKEIPGIGYKTAKRLEVLGINSVHDLQTFPIKTLEKELGIAIAQR
IQQLSFGEDKSPVTPSGPPQSFSEEDTFKKCSSEVEAKAKIEELLSSLLTRVCQDGRKPHTVRLVIRRYS
DKHCNRESRQCPIPSHVIQKLGTGNHDSMPPLIDILMKLFRNMVNVKMPFHLTLMSVCFCNLKALSSAKK
GPMDCYLTSLSTPAYTDKRAFKVKDTHTEDSHKEKEANWDCLPSRRIESTGTGESPLDATCFPKEKDTSD
LPLQALPEGVDQEVFKQLPADIQEEILSGKSRENLKGKGSLSCPLHASRGVLSFFSTKQMQASRLSPRDT
ALPSKRVSAASPCEPGTSGLSPRSTSHPSCGKDCSYYIDSQLKDEQTSQGPTESQGCQFSSTNPAVSGFH
SFPNLQTEQLFSTHRTVDSHKQTATASHQGLESHQGLESRELDSAEEKLPFPPDIDPQVFYELPEEVQKE
LMAEWERAGAARPSAHR
This is a mouse homolog of a yeast gene called RAD30. Submit the sequence
to PSI-BLAST searching the nr (non redundant) Genpro database. Use the given
(default) options of the program. Repeat the search for an additional iteration
using the cutoff scores recommended by the program.
- How many matches were found above the cutoff score after the initial
search?
- Using the Web links provided, identify some of the highest scoring
sequences. What classes of organisms do the matched genes originate from?
Is this sequence representative of a protein family found in just a few or
many organisms?
- How many additional matches were found after the first iteration,
and do most appear to be the same type of function; e.g., DNA repair or replication?
Q8. Assuming energy of -1 for an HH pair (non-neighbor, within distance
of 1):
P-----H-----H
H-----P
|
| |
H
H-----P P
P
|
|
| |
P
P-----H-----P P
|
|
P-----H-----H-----P-----P
A. What is the energy of the above conformation?
B. What is the conformation (in terms of UDLR) for this
peptide?
C. Can you give a different conformation for the same
peptide that has higher energy?
Q9. Write a perl script/C++ program to parse
PDB file 2igd, and from
the 3D atomic coordinates of the alpha-carbon atoms, construct a distance
matrix. Using a threshod of 7, construct a contact map and show it.
Label by hand the main secondary structural elements (show the alpha, beta
and loop regions), along with the sequence positions where they begin and
end.
Q10 (Bonus). Prove that if a tree is ultrametric, that is, it is additive
and all the distances from the root to any leaf are equal, then the 3-point
condition holds for any three leaves A,B, and C. The three point condition
states that two of the three possible pairwise distances are equal and not
less then the third. That is, one of the following must be true:
1) d(A,B) = d(A,C) and d(A,B) >= d(B,C), or
2) d(A,B) = d(B,C) and d(A,B) >= d(A,C), or
3) d(A,C) = d(B,C) and d(A,C) >= d(A,B)