Recent Changes - Search:

Main Page

Syllabus

Readings

Papers

Piazza Site

Submitty

Assignments

edit SideBar

Assign4

HMM/Viterbi

Due Date:20th March, 2019, Before Midnight

In this assignment you will implement the viterbi algorithm to find the best path or state-sequence for a given input DNA sequence. The HMM is specified in a file, where \(M\) is the number of symbols (4 for DNA), \(N\) is the number of states, \(A\) is the transition probability matrix between states, and \(E\) gives the symbol emission probabilities for each state. The state numbers begin at 0, with 0 being the begin state. There is no explicit end state. You can add one if you want.

The HMM is specified in compact form where only non-zero transition, and emission probabilities are specified. For example, one of the lines under "A:" is 1 21 0.404762, which means that the transition probability between state "1" and state "21" is 0.404762. Likewise, for the line under "E:" that reads 1 A 1.0 it means that the probability of emitting symbol 'A' in state 1 is 1.0.

You may assume that "A:" and "E:" appear alone on a single line. First you must read in all the transition probabilities (until you hit a "E:"), then read the emission probabilities.

Make sure that you convert all probabilities into log, and take the summation instead of products to compute the best path through the HMM for a given sequence (in fasta format). Read sec 3.6 in R12 for how/why of log transformation.


What to submit

Write a program called assign4.py, that outputs the best state state sequence for the input DNA sequence. Also print the probability of that path. Your code will be run as:
assign4.py HMMFILE SEQ
where HMMFILE is the HMM parameters file, and SEQ is the input sequence file.

Test your code on the HMM file hmm.txt, and the input sequence file Attach:test.txt.

Here is an example output on the following test sequence on Attach:hmm.txt file.

>test2
AGCTCAGTTGCTTATGCGACACCA

The output is:

Log-prob: -53.8207431931
State seq: [0, 1, 24, 2, 23, 25, 1, 24, 3, 23, 4, 22, 3, 23, 1, 23, 4, 22, 4, 21, 2, 21, 2, 22, 1]

The log (base 2) of the probability is given. Please note that State seq is the optimal path that begins with the begin state "0".

Edit - History - Print - Recent Changes - Search
Page last modified on March 12, 2019, at 08:09 PM