HomeWork #2

Q1. Determine the alignment score for each of the following:

Q2. Calculate the log odds and odds scores of a sequence alignment using the BLOSUM scoring matrix approach.

In one column of an alignment of a set of related, similar sequences, amino acid D changes to amino acid E at a frequency of 0.10, and the number of times this change is expected based on the number of occurrences of D and E in the column is 0.05.

Use the BLOSUM scores from above to calculate the log odds and odds score of a simple sequence alignment.


Q3. Assume that you are given the following block (local alignment) of 12 sequences:

WWYIR
WFYVR
WYYVR
WYFIR
WYYTR
WFYKR
WFYKR
WYYVR
WYYVR
WFYTR
WFYTR
WWYVR
Assume we cluster any sequence that occurs two or more times (i.e., cluster threshold 2/12 = 1/6). Compute the BLOSUM matrix from the clustered block. Express it in half bits.

ANSWER: after clustering we obtain the following sequences (along with cluster counts):
1- WWYIR
1- WFYVR
3- WYYVR
1- WYFIR
1- WYYTR
2- WFYKR
2- WFYTR
1- WWYVR
This is the same as 100% similarity as the clustering threshold. Since all sequences in a gropu are identical, the net effect is to treat each group as a single sequence.

We first compute the probability of single residues:
p(W) = 10/40, p(Y) = 10/40, p(I) = 2/40, p(R) = 8/40, p(F) =4/40, p(V) = 3/40, p(T) = 2/40, p(K) = 1/40

Now compute the pairwise probabilities:

First let's count the number of pairs from each column:
col 1: WW = 28
Col2, WW = 1, WY = 6, WF = 6, YY = 3, YF = 9, FF = 3
Col 3: YY = 21, YF = 7
Col 4: II = 1, IV = 6, IT = 4, IK = 2, VV = 3, VT = 6, VK = 3, TT = 1, TK = 2
Col 5: RR = 28

The raw counts are (out of a total of 140)

W
Y
I
R
F
V
T
K
W
29
6
0
0
0
0
0
0
Y

24
0
0
16
0
0
0
I


1
0
0
6
4
2
R



28
0
0
0
0
F




3
0
0
0
V





3
6
3
T






1
2
K







0

 Since there are 0 counts and since we are computing log odds ratio, we must add pseduo counts. Let's use the laplace rule and add a 1 to each piar. The new counts are as follows (out of a total of 140+36 = 176):

W
Y
I
R
F
V
T
K
W
30
7
1
1
1
1
1
1
Y

25
1
1
17
1
1
1
I


2
1
1
7
5
3
R



29
1
1
1
1
F




4
1
1
1
V





4
7
4
T






2
3
K







1


Compute random pairwise probabilities and take the log odds ratios (LOR):
for example for WW, random probability is 10/40*10/40 = 1/16
LOR(WW) = log(30/176 / 1/16) = log(2.727) = 1.45 bits = 2x1.45 =  2.9 half bits

On the other hand, random probability of WY = 2 x P(W) x P(Y) = 2*10/40*10/40 = 1/8
LOR(WY) = log (7/176  /  1/8) = log (0.3182) = -1.65 bits = -3.3 half bits

Repeat the same exercise for all pairs, and then round to closest integer to get for example,
BLOSUM (WW) = 3 and BLOSUM(WY) = -3, etc.

Q4. Compare the alignment scores obtained with small and large gap penalties in the following example.
For this question, use the program LALIGN on the University of Virginia FASTA server. This program aligns sequences by a local dynamic programming algorithm and includes end gap penalties. LALIGN produces as many different alignments as specified, with no two alignments including a match of the same two sequence positions.

Two sequences are provided in FASTA format: RECA from the bacterium E. coli and RAD51 from yeast. These proteins have the same function; i.e., promoting the pairing of homologous single-stranded DNAs. They almost certainly have the same three-dimensional structure but have diverged enough that they are difficult to align.


Q5. Assume that you are given the following two groups of aligned sequences. Use a global pairwise dynamic programming method to align these two groups using the sum of pairs scoring method (use match=1, mismatch=0, gap = -1):

Group 1: AC--TCG
         ACAGTAG
Group 2: AGACGTG
         --ACGT-

ANSWER:


-
-
A
A
C
C
-
A
-
G
T
T
C
A
G
G
-
-
0
-3
-6
-9
-12
-15
-19
-22
A
-
-3
0
-3
-6
-9
-12
-16
-19
G
-
-6
-3
-2
-5
-8
-11
-15
-17
A
A
-9
0
-1
-2
-5
-6
-8
-11
C
C
-12
-3
6
3
0
-3
-3
-6
G
G
-15
-6
3
4
3
2
-2
3
T
T
-18
-9
0
1
2
9
5
1
G
-
-21
-12
-3
-2
-1
6
6
5

One possible alignment is as follows: \-\\-\\|| (\ is an alignment, | is gap in sequence 1, - is gap in sequence 2)

	--AC--TCG
--ACAGTAG
AGAC-GT-G
        --AC-GT--

Total Score = 5

Q6. Assume we are using the tunneling method to search only within a specified region for a multiple sequence alignment. Let there be three sequences of lengths 3, 4 and 5. Assuming a tunnel of width 2 around the main diagonal, use the projection approach to calculate if the cell (1,3,4) is within the tunnel. Show all calculation.

ANSWER:

Given v= (1, 3, 4) and w = (3,4,5). Compute projection of v on w to get
v' = (v.w/w.w) w = (3+12+20/9+16+25) w = 35/50 (3,4,5) = (2.1, 2.8, 3.5)

now d = v-v' = (1,3,4) - (2.1,2.8,3.5) = (-1.1, 0.2, 0.5)
then |d| = sqrt (v'.v') = sqrt (1.21+0.04+0.25) = sqrt (1.5) = 1.225

Thus the cell is within the tunnel.

Q7. Using the CLUSTALW program, align the provided set of proteins in the RAD51-RECA group. These proteins
promote homologous DNA strand interactions during genetic recombination between DNA molecules.
CLUSTALW is available for PCs and also on a Web site at EBI. Copy and paste the sequence file into the CLUSTALW data window (sequence is in FASTA format). Just use the default conditions provided by the program.

Note the two kinds of multiple sequence alignment output formats. One is the ALN format with numbers, and the
second is the FASTA format with the aligned sequences joined end to end in FASTA format, with gaps in each sequence corresponding to the alignmnent.

 Answer:

>RAD51
MSQVQEQHISESQLQYGNGSLMSTVPADLSQSVVDGNGNGSSEDIEATNG
SGDGGGLQEQAEAQGEMEDEAYDEAALGSFVPIEKLQVNGITMADVKKLR
ESGLHTAEAVAYAPRKDLLEIKGISEAKADKLLNEAARLVPMG-FVTAAD
FHMRRSELICLTTGSKNLDTLLGGGVETGSITELFGEFRTGKSQLCHTLA
VTCQIPLDIGGGEGKCLYIDTEGTFRPVRLVSIAQRFGLDPDDALNNVAY
ARAYNADHQLRLLDAAAQMMSESR-----FSLIVVDSVMALYRTDFSGR-
GELSARQMHLAKFMRALQRLADQFGVAVVVTNQVVAQVDGGMAFN---PD
PKKPIGGNIMAHSSTTRLGFKKGKGCQRLCKVVDSPCLPEAECVFAIYED
GVGDPREEDE-----
>DMC1_YEAST
--------------------------------------------------
--------------MSVTGTEIDSDTAKNILSVDELQNYGINASDLQKLK
SGGIYTVNTVLSTTRRHLCKIKGLSEVKVEKIKEAAGKIIQVG-FIPATV
QLDIRQRVYSLSTGSKQLDSILGGGIMTMSITEVFGEFRCGKTQMSHTLC
VTTQLPREMGGGEGKVAYIDTEGTFRPERIKQIAEGYELDPESCLANVSY
ARALNSEHQMELVEQLGEELSSGD-----YRLIVVDSIMANFRVDYCGR-
GELSERQQKLNQHLFKLNRLAEEFNVAVFLTNQVQSDPGASALFAS--AD
GRKPIGGHVLAHASATRILLRKGRGDERVAKLQDSPDMPEKECVYVIGEK
GITDSSD--------
>RECA_ECOLI
-----------------------------MAIDENKQKALAAALGQIEKQ
FGKGSIMRLGEDRSMDVETISTGSLSLDIALGAGGLPMGRIVEIYGPESS
GKTTLTLQVIAAAQREGKTCAFIDAEHALDPIYARKLGVDIDN-LLCSQP
DTGEQALEICDALARSGAVDVIVVDSVAALTPKAEIEGEIGDSHMGLAAR
MMSQAMRKLAGNLKQSNTLLIFINQIRMKIGVMFGNPETTTGGNALKFYA
SVRLDIRRIGAVKEGENVVGSETR-----VKVVKNKIAAPFKQAEFQILY
GEGINFYGELVDLGVKEKLIEKAGAWYSYKGEKIGQGKANATAWLKDNPE
TAKEIEKKVRELLLS-------NPNSTPDFSVDDSEGVAETNEDF-----
---------------
>RECA_STRVL
------------------------------MAGTDREKALDAALAQIERQ
FGKGAVMRMGDRTQEPIEVISTGSTALDIALGVGGLPRGRVVEIYGPESS
GKTTLTLHAVANAQKAGGQVAFVDAEHALDPEYAKKLGVDIDN-LILSQP
DNGEQALEIVDMLVRSGALDLIVIDSVAALVPRAEIEGEMGDSHVGLQAR
LMSQALRKITSALNQSKTTAIFINQLREKIGVMFGSPETTTGGRALKFYA
SVRLDIRRIETLKDGTDAVGNRTR-----VKVVKNKVAPPFKQAEFDILY
GQGISREGGLIDMGVEHGFVRKAGAWYTYEGDQLGQGKENARNFLKDNPD
LADEIERKIKEKLGVGVRPDAAKAEAATDAAAADTAGTDDAAKSVPAPAS
KTAKATKATAVKS--
>RADA_ARCFU
--------------------------------------------------
---------------------MSEESNEETKIIELEDIPGVGPETARKLR
EAGYSTIEAVAVASPSELANVGGITEGNAVKIIQAARKLANIGGFESGDK
VLERRRSVKKITTGSKDLDELLGGGVETQAITEFFGEFGSGKTQICHQLA
VNVQLPEDEGGLEGSVIIIDTENTFRPERIIQMAEAKGLDGNEVLKNIYV
AQAYNSNHQMLLVDNAKELAEKLKKEGRPVRLIIVDSLMSHFRAEYVGR-
GTLADRQQKLNRHLHDLMKFGELYNAAIVVTNQVMAR--PDVLFG----D
PTKPVGGHIVAHTATFRIYLKKGKDDLRIARLIDSPHLPEGEAIFRVTER
GIEDAEEKDKKKRKK


Q8. When a multiple sequence alignment can be made, then we can pick out the most conserved regions (motifs), make a scoring matrix, and search for other sequences that have this same motif. The matrix will take into account the variation found in the sequences. We will make a position-specific scoring matrix (also called a PSSM or weight matrix) to a part of a given multiple sequence alignment and using the matrix to scan a sequence.

Here is a table showing the frequency of each base in an alignment that is 4 bases long.

Q9. Analyze the following 10 DNA sequences by the expectation maximization algorithm. Assume that the background base frequencies are each 0.25 and that the middle 3 positions are a motif. The size of the motif is an informed guess based on a molecular model, and the alignment of the sequences is also an informed guess.
sequence 1   C CAG A
sequence 2   G TTA A
sequence 3   G TAC C
sequence 4   T TAT T
sequence 5   C AGA T
sequence 6   T TTT G
sequence 7   A TAC T
sequence 8   C TAT G
sequence 9   A GCT C
sequence 10  G TAG A
Q10. MEME is a server that will take as input a set of sequences and find alignment by the expectation maximization method. Paste the same unaligned RECA-RAD51 sequences (NOT the aligned ones) into the window and use the defaults provided by the program. Try with other options like "optimal" number of motifs, etc. Summarize your output. List all motifs (do not cut an paste all the output, just the relevant parts).

Q11. Analyze the following 10 DNA sequences for a conserved pattern by the Gibbs sampling algorithm.

sequence 1   C CAG A
sequence 2   G TTA A
sequence 3   G TAC C
sequence 4   T TAT T
sequence 5   C AGA T
sequence 6   T TTT G
sequence 7   A TAC T
sequence 8   C TAT G
sequence 9   A GCT C
sequence 10  G TAG A
Q12. Consider the hidden Markov Model (HMM) below:
Red square, match state; green diamond, insert state; blue circle, delete state. Arrows indicate probability of going from one state to the next.