From Mohammed J. Zaki

CompbioCourse: Assign2

HMM/Viterbi

Due Date:1st March, 2013, 11:59:59pm

In this assignment you will implement the viterbi algorithm to find the best path or state-sequence for a given input DNA sequence. The HMM is specified in the file hmm.txt, where M is the number of symbols (4 for DNA), N is the number of states, A is the transition probability matrix between states, pi is the initial state probabilities, and B gives the symbol emission probabilities for each state. The state numbers begin at 0.

The HMM is specified in compact form where only non-zero transition, emission and initial probabilities are specified. For example the first line under "A:" is 0 20 0.404762 which means that the transition probability between state "0" and state "20" is 0.404762. Likewise, the first line under "B:" is 0 A 1 which means that the probability of emitting symbol 'A' in state 0 is 1.0. Finally, the first line under "pi:" is 0 0.304392 which means that the initial probability of state "0" is 0.304392.

You may assume that "A:", "B;" and "pi:" appear alone on a single line. First you must read in all the transition probabilities (until you hit a "B:"), then read the emission probabilities unil you hit the "pi:", and finally all lines that follow are for "pi".

Make sure that you convert all probabilities into log, and take the summation instead of products to compute the best path through the HMM for the sequence (in fasta format) in Attach:test.txt

What to submit

Write a program called assign2-RCSID.xx (xx is 'r', 'py'), that outputs the best state state sequence for the input DNA sequence above. Also print the probability of that path. Please read the input sequence filename from the command line so I can test your code on other sequences. Do not hard code path names, but assume that the hmm.txt file is in the local dir. Email your script and output file to bioinfo.rpi@gmail.com. The subject of the email should be assign2-rcsid.

HINT: Add pi as transitions from a new "begin" state. Make the begin state as state 0 and increment the id of the the rest of the states by one. Then follow the viterbi method given in the DEKM chapter (see readings page on HMMs).

Here is an example output on the following test sequence:

>test2
AGCTCAGTTGCTTATGCGACACCA

The output is:

Log-prob: -53.8207431931
State seq: [0, 1, 24, 2, 23, 25, 1, 24, 3, 23, 4, 22, 3, 23, 1, 23, 4, 22, 4, 21, 2, 21, 2, 22, 1]
Original State seq: [0, 23, 1, 22, 24, 0, 23, 2, 22, 3, 21, 2, 22, 0, 22, 3, 21, 3, 20, 1, 20, 1, 21, 0]

The log (base 2) of the probability is given. Please note that State seq is the path after adding an extra "0" begin state, as in the hint. In terms of the original states, the path sequence is given after Original State seq

Retrieved from http://www.cs.rpi.edu/~zaki/www-new/pmwiki.php/CompbioCourse/Assign2
Page last modified on February 18, 2013, at 01:10 PM