HomeWork #2

Q1. Determine the alignment score for each of the following:

Q2. Calculate the log odds and odds scores of a sequence alignment using the BLOSUM scoring matrix approach.

In one column of an alignment of a set of related, similar sequences, amino acid D changes to amino acid E at a frequency of 0.10, and the number of times this change is expected based on the number of occurrences of D and E in the column is 0.05.

  • What is the odds of finding a D to E substitution in an alignment?
  • What is the log odds score for the D to E substitution in bits? (Note: log2 = natural log / 0.693.)
  • What would be the entry in the BLOSUM amino acid scoring matrix for this substitution?
  • In the same column, D does not change at all at a frequency 0.80, and the expected frequency of D not changing is 0.10. Calculate the corresponding log odds score and the BLOSUM entry for D not changing.
  • Use the BLOSUM scores from above to calculate the log odds and odds score of a simple sequence alignment.
  • What is the log odds score of the following alignment in bits?
  • DEDEDEDE
    DDDDDDDD
  • What is the odds score of the above alignment?

  • Q3. Assume that you are given the following block (local alignment) of 12 sequences:

    WWYIR
    WFYVR
    WYYVR
    WYFIR
    WYYTR
    WFYKR
    WFYKR
    WYYVR
    WYYVR
    WFYTR
    WFYTR
    WWYVR
    Assume we cluster any sequence that occurs two or more times (i.e., cluster threshold 2/12 = 1/6). Compute the BLOSUM matrix from the clustered block. Express it in half bits.

    Q4. Compare the alignment scores obtained with small and large gap penalties in the following example.
    For this question, use the program LALIGN on the University of Virginia FASTA server. This program aligns sequences by a local dynamic programming algorithm and includes end gap penalties. LALIGN produces as many different alignments as specified, with no two alignments including a match of the same two sequence positions.

    Two sequences are provided in FASTA format: RECA from the bacterium E. coli and RAD51 from yeast. These proteins have the same function; i.e., promoting the pairing of homologous single-stranded DNAs. They almost certainly have the same three-dimensional structure but have diverged enough that they are difficult to align.


    Q5. Assume that you are given the following two groups of aligned sequences. Use a global pairwise dynamic programming method to align these two groups using the sum of pairs scoring method (use match=1, mismatch=0, gap = -1):

    Group 1: AC--TCG
             ACAGTAG
    Group 2: AGACGTG
             --ACGT-


    Q6. Assume we are using the tunneling method to search only within a specified region for a multiple sequence alignment. Let there be three sequences of lengths 3, 4 and 5. Assuming a tunnel of width 2 around the main diagonal, use the projection approach to calculate if the cell (1,3,4) is within the tunnel. Show all calculation.

    Q7. Using the CLUSTALW program, align the provided set of proteins in the RAD51-RECA group. These proteins
    promote homologous DNA strand interactions during genetic recombination between DNA molecules.
    CLUSTALW is available for PCs and also on a Web site at EBI. Copy and paste the sequence file into the CLUSTALW data window (sequence is in FASTA format). Just use the default conditions provided by the program.

    Note the two kinds of multiple sequence alignment output formats. One is the ALN format with numbers, and the
    second is the FASTA format with the aligned sequences joined end to end in FASTA format, with gaps in each sequence corresponding to the alignmnent.

    Q8. When a multiple sequence alignment can be made, then we can pick out the most conserved regions (motifs), make a scoring matrix, and search for other sequences that have this same motif. The matrix will take into account the variation found in the sequences. We will make a position-specific scoring matrix (also called a PSSM or weight matrix) to a part of a given multiple sequence alignment and using the matrix to scan a sequence. 

    Here is a table showing the frequency of each base in an alignment that is 4 bases long.

    Q9. Analyze the following 10 DNA sequences by the expectation maximization algorithm. Assume that the background base frequencies are each 0.25 and that the middle 3 positions are a motif. The size of the motif is an informed guess based on a molecular model, and the alignment of the sequences is also an informed guess.
    sequence 1   C CAG A
    sequence 2   G TTA A
    sequence 3   G TAC C
    sequence 4   T TAT T
    sequence 5   C AGA T
    sequence 6   T TTT G
    sequence 7   A TAC T
    sequence 8   C TAT G
    sequence 9   A GCT C
    sequence 10  G TAG A
    Q10. MEME is a server that will take as input a set of sequences and find alignment by the expectation maximization method. Paste the same unaligned RECA-RAD51 sequences (NOT the aligned ones) into the window and use the defaults provided by the program. Try with other options like "optimal" number of motifs, etc. Summarize your output. List all motifs (do not cut an paste all the output, just the relevant parts).

    Q11. Analyze the following 10 DNA sequences for a conserved pattern by the Gibbs sampling algorithm.
    sequence 1   C CAG A
    sequence 2   G TTA A
    sequence 3   G TAC C
    sequence 4   T TAT T
    sequence 5   C AGA T
    sequence 6   T TTT G
    sequence 7   A TAC T
    sequence 8   C TAT G
    sequence 9   A GCT C
    sequence 10  G TAG A
    Q12. Consider the hidden Markov Model (HMM) below: 
    Red square, match state; green diamond, insert state; blue circle, delete state. Arrows indicate probability of going from one state to the next.