Q1. Determine the alignment score for each of the following:
--TCTGTACGCGATCATGT
TAGC-GTCCGATAT-A---
AGATAGAAACTGATATATA
AG---AAAACAGAGT----
AGATAGAAACTGATATATA
AG---AAAACAGAGT----
In one column of an alignment of a set of related, similar sequences, amino acid D changes to amino acid E at a frequency of 0.10, and the number of times this change is expected based on the number of occurrences of D and E in the column is 0.05.
Use the BLOSUM scores from above to calculate the log odds and odds score of a simple sequence alignment.What is the odds of finding a D to E substitution in an alignment? What is the log odds score for the D to E substitution in bits? (Note: log2 = natural log / 0.693.) What would be the entry in the BLOSUM amino acid scoring matrix for this substitution? In the same column, D does not change at all at a frequency 0.80, and the expected frequency of D not changing is 0.10. Calculate the corresponding log odds score and the BLOSUM entry for D not changing.
What is the log odds score of the following alignment in bits? DEDEDEDE
DDDDDDDDWhat is the odds score of the above alignment?
Q3. Assume that you are given the following block (local alignment) of 12
sequences:
Assume we cluster any sequence that occurs two or more times (i.e., cluster threshold 2/12 = 1/6). Compute the BLOSUM matrix from the clustered block. Express it in half bits.WWYIR
WFYVR
WYYVR
WYFIR
WYYTR
WFYKR
WFYKR
WYYVR
WYYVR
WFYTR
WFYTR
WWYVR
Q4. Compare the alignment scores obtained with small and large gap penalties
in the following example.
For this question, use the program LALIGN on the University of Virginia FASTA server.
This program aligns sequences by a local dynamic programming algorithm and
includes end gap penalties. LALIGN produces as many different alignments
as specified, with no two alignments including a match of the same two sequence
positions.
Two sequences are provided in FASTA format: RECA from the bacterium E. coli and RAD51 from yeast. These proteins have the same function; i.e., promoting the pairing of homologous single-stranded DNAs. They almost certainly have the same three-dimensional structure but have diverged enough that they are difficult to align.
Q5. Assume that you are given the following two groups of aligned sequences.
Use a global pairwise dynamic programming method to align these two groups
using the sum of pairs scoring method (use match=1, mismatch=0, gap = -1):
Group 1: AC--TCG
ACAGTAGGroup 2: AGACGTG
--ACGT-
Q6. Assume we are using the tunneling method to search only within a specified
region for a multiple sequence alignment. Let there be three sequences of
lengths 3, 4 and 5. Assuming a tunnel of width 2 around the main diagonal,
use the projection approach to calculate if the cell (1,3,4) is within the
tunnel. Show all calculation.
Q7. Using the CLUSTALW program, align the provided set of proteins in the
RAD51-RECA group. These proteins
promote homologous DNA strand interactions during genetic recombination between
DNA molecules.
CLUSTALW is available for PCs and also on a Web site at EBI. Copy and paste the sequence file into the CLUSTALW
data window (sequence is in FASTA format). Just use the default conditions
provided by the program.
Note the two kinds of multiple sequence alignment output formats. One
is the ALN format with numbers, and the
second is the FASTA format with the aligned sequences joined end to end in
FASTA format, with gaps in each sequence corresponding to the alignmnent.
Q8. When a multiple sequence alignment can be made, then we can pick out the most conserved regions (motifs), make a scoring matrix, and search for other sequences that have this same motif. The matrix will take into account the variation found in the sequences. We will make a position-specific scoring matrix (also called a PSSM or weight matrix) to a part of a given multiple sequence alignment and using the matrix to scan a sequence.
Here is a table showing the frequency of each base in an alignment that is 4 bases long.
sequence 1 C CAG A
sequence 2 G TTA A
sequence 3 G TAC C
sequence 4 T TAT T
sequence 5 C AGA T
sequence 6 T TTT G
sequence 7 A TAC T
sequence 8 C TAT G
sequence 9 A GCT C
sequence 10 G TAG A
sequence 1 C CAG A
sequence 2 G TTA A
sequence 3 G TAC C
sequence 4 T TAT T
sequence 5 C AGA T
sequence 6 T TTT G
sequence 7 A TAC T
sequence 8 C TAT G
sequence 9 A GCT C
sequence 10 G TAG A
Red square, match state; green diamond, insert state; blue circle, delete state. Arrows indicate probability of going from one state to the next.![]()