Q1. Determine the alignment score for each of the following:
--TCTGTACGCGATCATGT
TAGC-GTCCGATAT-A---
AGATAGAAACTGATATATA
AG---AAAACAGAGT----
AGATAGAAACTGATATATA
AG---AAAACAGAGT----
Q2. Calculate the log odds and odds scores of a sequence alignment using the BLOSUM scoring matrix approach.
- ANSWER: gaps in end not counted, 8 matches, 4 mismatches, 1 gap start, 2 gap extensions, Score: 0
In one column of an alignment of a set of related, similar sequences, amino acid D changes to amino acid E at a frequency of 0.10, and the number of times this change is expected based on the number of occurrences of D and E in the column is 0.05.
DEDEDEDE
DDDDDDDD
- Odds = (frequency of matches/chance of getting a random match) = (0.1/0.05) = 2.
- Log odds = log2(odds score) = log22 = 1 bit.
- To enter the log odds score into the BLOSUM62 matrix, it would first need to be converted to half-bit units. Half-bit units are obtained by taking 2 [log2(odds score)] = 2 (log odds score). BLOSUM entry = 2 (log odds score) = 2 (1) = 2 half-bits.
- Odds = (observed frequency of D-D matches/expected frequency of D-D matches) = 0.8/0.1 = 8.
log odds = log2(odds score) = log28 = 3 bits.
BLOSUM entry = 2 (log odds score) = 2 (3) = 6 half-bits.- Part II.
- D-D match is scored as 3 bits.
D-E mismatch is scored as 1 bit.
(4 matches x 3 bits) + (4 mismatches x 1 bit) = 16 bits.- Odds score = 2log odds score = 216 = 65536.
Q3. Assume that you are given the following block (local alignment) of
12 sequences:
Assume we cluster any sequence that occurs two or more times (i.e., cluster threshold 2/12 = 1/6). Compute the BLOSUM matrix from the clustered block. Express it in half bits.WWYIR
WFYVR
WYYVR
WYFIR
WYYTR
WFYKR
WFYKR
WYYVR
WYYVR
WFYTR
WFYTR
WWYVR
This is the same as 100% similarity as the clustering threshold. Since all sequences in a gropu are identical, the net effect is to treat each group as a single sequence.1- WWYIR
1- WFYVR
3- WYYVR
1- WYFIR
1- WYYTR
2- WFYKR
2- WFYTR
1- WWYVR
| W |
Y |
I |
R |
F |
V |
T |
K |
|
| W |
29 |
6 |
0 |
0 |
0 |
0 |
0 |
0 |
| Y |
24 |
0 |
0 |
16 |
0 |
0 |
0 |
|
| I |
1 |
0 |
0 |
6 |
4 |
2 |
||
| R |
28 |
0 |
0 |
0 |
0 |
|||
| F |
3 |
0 |
0 |
0 |
||||
| V |
3 |
6 |
3 |
|||||
| T |
1 |
2 |
||||||
| K |
0 |
| W |
Y |
I |
R |
F |
V |
T |
K |
|
| W |
30 |
7 |
1 |
1 |
1 |
1 |
1 |
1 |
| Y |
25 |
1 |
1 |
17 |
1 |
1 |
1 |
|
| I |
2 |
1 |
1 |
7 |
5 |
3 |
||
| R |
29 |
1 |
1 |
1 |
1 |
|||
| F |
4 |
1 |
1 |
1 |
||||
| V |
4 |
7 |
4 |
|||||
| T |
2 |
3 |
||||||
| K |
1 |
Q4. Compare the alignment scores obtained with small and large gap penalties
in the following example.
For this question, use the program LALIGN on the University of Virginia
FASTA server.
This program aligns sequences by a local dynamic programming algorithm and
includes end gap penalties. LALIGN produces as many different alignments
as specified, with no two alignments including a match of the same two sequence
positions.
Two sequences are provided in FASTA format: RECA from the bacterium E. coli and RAD51 from yeast. These proteins have the same function; i.e., promoting the pairing of homologous single-stranded DNAs. They almost certainly have the same three-dimensional structure but have diverged enough that they are difficult to align.
Q5. Assume that you are given the following two groups of aligned sequences.
Use a global pairwise dynamic programming method to align these two groups
using the sum of pairs scoring method (use match=1, mismatch=0, gap = -1):
ANSWER:Group 1: AC--TCG
ACAGTAGGroup 2: AGACGTG
--ACGT-
| - - |
A A |
C C |
- A |
- G |
T T |
C A |
G G |
|
| - - |
0 |
-3 |
-6 |
-9 |
-12 |
-15 |
-19 |
-22 |
| A - |
-3 |
0 |
-3 |
-6 |
-9 |
-12 |
-16 |
-19 |
| G - |
-6 |
-3 |
-2 |
-5 |
-8 |
-11 |
-15 |
-17 |
| A A |
-9 |
0 |
-1 |
-2 |
-5 |
-6 |
-8 |
-11 |
| C C |
-12 |
-3 |
6 |
3 |
0 |
-3 |
-3 |
-6 |
| G G |
-15 |
-6 |
3 |
4 |
3 |
2 |
-2 |
3 |
| T T |
-18 |
-9 |
0 |
1 |
2 |
9 |
5 |
1 |
| G - |
-21 |
-12 |
-3 |
-2 |
-1 |
6 |
6 |
5 |
One possible alignment is as follows: \-\\-\\|| (\ is an alignment, |
is gap in sequence 1, - is gap in sequence 2)
--AC--TCG
--ACAGTAG
AGAC-GT-G
--AC-GT--
Total Score = 5
Q6. Assume we are using the tunneling method to search only within a
specified region for a multiple sequence alignment. Let there be three sequences
of lengths 3, 4 and 5. Assuming a tunnel of width 2 around the main diagonal,
use the projection approach to calculate if the cell (1,3,4) is within the
tunnel. Show all calculation.
ANSWER:
Given v= (1, 3, 4) and w = (3,4,5). Compute projection of v on w to get
v' = (v.w/w.w) w = (3+12+20/9+16+25) w = 35/50 (3,4,5) = (2.1, 2.8, 3.5)
now d = v-v' = (1,3,4) - (2.1,2.8,3.5) = (-1.1, 0.2, 0.5)
then |d| = sqrt (v'.v') = sqrt (1.21+0.04+0.25) = sqrt (1.5) = 1.225
Thus the cell is within the tunnel.
Q7. Using the CLUSTALW program, align the provided set of proteins in the
RAD51-RECA group. These proteins
promote homologous DNA strand interactions during genetic recombination
between DNA molecules.
CLUSTALW is available for PCs and also on a Web site at EBI. Copy and paste the sequence file into the CLUSTALW
data window (sequence is in FASTA format). Just use the default conditions
provided by the program.
Note the two kinds of multiple sequence alignment output formats. One is
the ALN format with numbers, and the
second is the FASTA format with the aligned sequences joined end to end
in FASTA format, with gaps in each sequence corresponding to the alignmnent.
Answer:
>RAD51
MSQVQEQHISESQLQYGNGSLMSTVPADLSQSVVDGNGNGSSEDIEATNG
SGDGGGLQEQAEAQGEMEDEAYDEAALGSFVPIEKLQVNGITMADVKKLR
ESGLHTAEAVAYAPRKDLLEIKGISEAKADKLLNEAARLVPMG-FVTAAD
FHMRRSELICLTTGSKNLDTLLGGGVETGSITELFGEFRTGKSQLCHTLA
VTCQIPLDIGGGEGKCLYIDTEGTFRPVRLVSIAQRFGLDPDDALNNVAY
ARAYNADHQLRLLDAAAQMMSESR-----FSLIVVDSVMALYRTDFSGR-
GELSARQMHLAKFMRALQRLADQFGVAVVVTNQVVAQVDGGMAFN---PD
PKKPIGGNIMAHSSTTRLGFKKGKGCQRLCKVVDSPCLPEAECVFAIYED
GVGDPREEDE-----
>DMC1_YEAST
--------------------------------------------------
--------------MSVTGTEIDSDTAKNILSVDELQNYGINASDLQKLK
SGGIYTVNTVLSTTRRHLCKIKGLSEVKVEKIKEAAGKIIQVG-FIPATV
QLDIRQRVYSLSTGSKQLDSILGGGIMTMSITEVFGEFRCGKTQMSHTLC
VTTQLPREMGGGEGKVAYIDTEGTFRPERIKQIAEGYELDPESCLANVSY
ARALNSEHQMELVEQLGEELSSGD-----YRLIVVDSIMANFRVDYCGR-
GELSERQQKLNQHLFKLNRLAEEFNVAVFLTNQVQSDPGASALFAS--AD
GRKPIGGHVLAHASATRILLRKGRGDERVAKLQDSPDMPEKECVYVIGEK
GITDSSD--------
>RECA_ECOLI
-----------------------------MAIDENKQKALAAALGQIEKQ
FGKGSIMRLGEDRSMDVETISTGSLSLDIALGAGGLPMGRIVEIYGPESS
GKTTLTLQVIAAAQREGKTCAFIDAEHALDPIYARKLGVDIDN-LLCSQP
DTGEQALEICDALARSGAVDVIVVDSVAALTPKAEIEGEIGDSHMGLAAR
MMSQAMRKLAGNLKQSNTLLIFINQIRMKIGVMFGNPETTTGGNALKFYA
SVRLDIRRIGAVKEGENVVGSETR-----VKVVKNKIAAPFKQAEFQILY
GEGINFYGELVDLGVKEKLIEKAGAWYSYKGEKIGQGKANATAWLKDNPE
TAKEIEKKVRELLLS-------NPNSTPDFSVDDSEGVAETNEDF-----
---------------
>RECA_STRVL
------------------------------MAGTDREKALDAALAQIERQ
FGKGAVMRMGDRTQEPIEVISTGSTALDIALGVGGLPRGRVVEIYGPESS
GKTTLTLHAVANAQKAGGQVAFVDAEHALDPEYAKKLGVDIDN-LILSQP
DNGEQALEIVDMLVRSGALDLIVIDSVAALVPRAEIEGEMGDSHVGLQAR
LMSQALRKITSALNQSKTTAIFINQLREKIGVMFGSPETTTGGRALKFYA
SVRLDIRRIETLKDGTDAVGNRTR-----VKVVKNKVAPPFKQAEFDILY
GQGISREGGLIDMGVEHGFVRKAGAWYTYEGDQLGQGKENARNFLKDNPD
LADEIERKIKEKLGVGVRPDAAKAEAATDAAAADTAGTDDAAKSVPAPAS
KTAKATKATAVKS--
>RADA_ARCFU
--------------------------------------------------
---------------------MSEESNEETKIIELEDIPGVGPETARKLR
EAGYSTIEAVAVASPSELANVGGITEGNAVKIIQAARKLANIGGFESGDK
VLERRRSVKKITTGSKDLDELLGGGVETQAITEFFGEFGSGKTQICHQLA
VNVQLPEDEGGLEGSVIIIDTENTFRPERIIQMAEAKGLDGNEVLKNIYV
AQAYNSNHQMLLVDNAKELAEKLKKEGRPVRLIIVDSLMSHFRAEYVGR-
GTLADRQQKLNRHLHDLMKFGELYNAAIVVTNQVMAR--PDVLFG----D
PTKPVGGHIVAHTATFRIYLKKGKDDLRIARLIDSPHLPEGEAIFRVTER
GIEDAEEKDKKKRKK
Q8. When a multiple sequence alignment can be made, then we can pick out the most conserved regions (motifs), make a scoring matrix, and search for other sequences that have this same motif. The matrix will take into account the variation found in the sequences. We will make a position-specific scoring matrix (also called a PSSM or weight matrix) to a part of a given multiple sequence alignment and using the matrix to scan a sequence.
Here is a table showing the frequency of each base in an alignment that is 4 bases long.
log odds score = log2(odds score) = log2(2.4) = 1.26 bits
The complete table is shown below:
site 1 = TGAG -0.32 + 1.49 - 0.32 - 1.32 = -0.47
site 2 = GAGC -1.32 - 1.32 - 1.32 - 1.32 = -5.28
site 3 = AGCT 1.26 + 1.49 + 1.26 + 1.49 = 5.50
site 4 = GCTA -1.32 - 1.32 - 1.32 - 1.32 = -5.28
site 5 = CTAA -1.32 - 1.32 - 0.32 - 1.32 = -4.28
site 1 = TGAG odds score = 0.721; probability = 0.016
site 2 = GAGC odds score = 0.026; probability = 0.001
site 3 = AGCT odds score = 45.255; probability = 0.982
site 4 = GCTA odds score = 0.026; probability = 0.001
site 5 = CTAA odds score = 0.051; probability = 0.001
sequence 1 C CAG A
sequence 2 G TTA A
sequence 3 G TAC C
sequence 4 T TAT T
sequence 5 C AGA T
sequence 6 T TTT G
sequence 7 A TAC T
sequence 8 C TAT G
sequence 9 A GCT C
sequence 10 G TAG A
site 1 = CAGXX = 0.1 x 0.6 x 0.2 x 0.25 x 0.25 = 0.000750
site 2 = XAGAX = 0.25 x 0.1 x 0.1 x 0.2 x 0.25 = 0.000125
site 3 = XXGAT = 0.25 x 0.25 x 0.1 x 0.6 x 0.4 = 0.001500
site 1 = CAGXX = 0.000750/0.002375 = 0.315
site 2 = XAGAX = 0.000125/0.002375 = 0.053
site 3 = XXGAT = 0.001500/0.002375 = 0.632
Q11. Analyze the following 10 DNA sequences for a conserved pattern by the Gibbs sampling algorithm.
sequence 1 C CAG A
sequence 2 G TTA A
sequence 3 G TAC C
sequence 4 T TAT T
sequence 5 C AGA T
sequence 6 T TTT G
sequence 7 A TAC T
sequence 8 C TAT G
sequence 9 A GCT C
sequence 10 G TAG A
site 1, GTT = -1.32 - 0.32 + 0.68 = -0.96
site 2, TTT = 1.49 - 0.32 + 0.68 = 1.85
site 3, TTG = 1.49 - 0.32 - 0.32 = 0.85
site 1, GTT: odds = -0.96, log odds = 0.514, probability = 0.087
site 1, TTT: odds = 1.85, log odds = 3.605, probability = 0.609
site 1, TTG: odds = 0.85, log odds = 1.803, probability = 0.304
Red square, match state; green diamond, insert state; blue circle, delete state. Arrows indicate probability of going from one state to the next.![]()
begin > match 1 = 0.7
match 1 (T) = 0.5
match 1 > match 2 = 0.7
match 2 (A) = 0.4
match 2 > match 3 = 0.7
match 3 (G) = 0.4
match 3 > end = 0.9
----------------------------
product = 0.0247- Repeat part A for insert-match-delete-match path.
begin > insert 1 = 0.1
insert 1 = 0.25
insert 1 > match 1 = 1.0
match 1 (A) = 0.1
match 1 > delete 2 = 0.2
delete 2 = 1.0
delete 2 > match 3 = 1.0
match 3 (G) = 0.4
match 3 > end = 0.9
----------------------------
product = 0.00018- The path from part A is more probable, since 0.0247 >> 0.00018. Path A is 0.0247/0.00018 = 137 times more probable than path B.
- The log odds scores for each position in the model are shown below:
![]()
Log odds score for alignment in part A
begin > match 1 = 1.09 bits
match 1 (T) = 1.00 bits
match 1 > match 2 = 1.09 bits
match 2 (A) = 0.68 bits
match 2 > match 3 = 1.09 bits
match 3 (G) = 0.68 bits
match 3 > end = 0.85 bits
-------------------------------
sum = 6.48 bits