Search:

# Assign4

## HMM/Viterbi

Due Date:20th March, 2019, Before Midnight

In this assignment you will implement the viterbi algorithm to find the best path or state-sequence for a given input DNA sequence. The HMM is specified in a file, where $$M$$ is the number of symbols (4 for DNA), $$N$$ is the number of states, $$A$$ is the transition probability matrix between states, and $$E$$ gives the symbol emission probabilities for each state. The state numbers begin at 0, with 0 being the begin state. There is no explicit end state. You can add one if you want.

The HMM is specified in compact form where only non-zero transition, and emission probabilities are specified. For example, one of the lines under "A:" is 1 21 0.404762, which means that the transition probability between state "1" and state "21" is 0.404762. Likewise, for the line under "E:" that reads 1 A 1.0 it means that the probability of emitting symbol 'A' in state 1 is 1.0.

You may assume that "A:" and "E:" appear alone on a single line. First you must read in all the transition probabilities (until you hit a "E:"), then read the emission probabilities.

Make sure that you convert all probabilities into log, and take the summation instead of products to compute the best path through the HMM for a given sequence (in fasta format). Read sec 3.6 in R12 for how/why of log transformation.

## What to submit

Write a program called assign4.py, that outputs the best state state sequence for the input DNA sequence. Also print the probability of that path. Your code will be run as:
assign4.py HMMFILE SEQ
where HMMFILE is the HMM parameters file, and SEQ is the input sequence file.

Test your code on the HMM file hmm.txt, and the input sequence file Attach:test.txt.

Here is an example output on the following test sequence on Attach:hmm.txt file.

>test2
AGCTCAGTTGCTTATGCGACACCA


The output is:

Log-prob: -53.8207431931
State seq: [0, 1, 24, 2, 23, 25, 1, 24, 3, 23, 4, 22, 3, 23, 1, 23, 4, 22, 4, 21, 2, 21, 2, 22, 1]


The log (base 2) of the probability is given. Please note that State seq is the optimal path that begins with the begin state "0".