Search:

# Assign2: Maximal Unique Matches (MUMs): Due Fri 15th Feb, before midnight

In this assignment you will find the maximal unique matches between two viral genomes, namely between Bombyx mori nucleopolyhedrovirus (BmNPV; RefSeq ID NC_001962.1), and Autographa californica multiple nucleopolyhedrovirus (AcMNPV; RefSeq ID NC_001623.1). BmNPV is a baculovirus that selectively infects the domestic silkworm, and AcMNPV is also a baculovirus that infects butterflies and moths.

For finding MUMs, you can use any public implementation of Suffix Trees in python and traverse the tree to find the MUMs as explained in the notes/reading material. Namely, you have to construct a generalized suffix tree first that includes both the genomes, and then you look for a non-leaf node with exactly two leaves -- one from BmNPV and one from AcMNPV. If this non-leaf node is left-diverse, then it must be a MUM. In addition you must filter the MUMs based on a minlength parameter, reporting only the MUMs longer than that threshold.

For example, if seq1='ACAGATA' and seq2='GATACA', and minlength=2, your code should output the MUMs in the following format:
'ACA', seq1[0:3], seq2[3:6]
'GATA', seq1[3:7], seq2[0:4]
number of MUMs=2

For this example there are only 2 MUMs, ACA and GATA. For ACA, it appears as Seq1[0:3], since the first three chars of seq1 are ACA, and as seq2[3:6], since ACA appears in seq2 starting at position 3 (counting from 0 as the first position).

So you code should output each MUM, followed by its start and end positions in each sequence, and then it should print the total number of MUMs found. Make minlength a parameter, so if I run the code with minlength=4, you would only report GATA for the example above.

## What to submit

Write a python script called mum.py, which will be run as:
mum.py minlength

Your script should be run on the full genome sequence of BmNPV and AcMNPV, which you can get from NCBI. Submit output.txt file for minlength=100. Submit the script and output file via submitty.

As for suffix tree implementation, do not try to implement your own, use an open-source implementation with a python API. See below for a suggestion.

## Suffix Tree

There is an efficient suffix tree implementation available at http://www.daimi.au.dk/~mailund/suffix_tree.html. It is implemented in C, and provides a python API, as documented on the webpage, and as you can see from the suffix_tree.py file. You can install this by running "python setup.py install" after extracting the tar file. The only drawback is that this implementation will work only in python2.7, since the pythonAPI it exports is for python2.7, so you will have to install python2.7 to use this implementation.

Here is how this package works. It provides a GeneralisedSuffixTree call, where you supply 2 sequences. So if you call it on our example sequences, then GeneralisedSuffixTree([ACAGATA','GATACA']) will build the suffix tree for the following concatenated string:
'ACAGATA1GATACA2$' Basically it add a terminal char '1' at the end of seq1 and '2' at the end of seq2, and also adds a '$' symbol automatically. It also provides functions to traverse the tree, check if a node is a leaf node, and so on.