Assign 3

Motif Discovery: Gibbs Sampling

Due Date: Mar 25th, before Midnight

In this assignment you will implement the Gibbs Sampling algorithm to discover the most prominent motif in the dataset hm20r.fasta. The dataset consists of real upstream regions from human genes.

The output should consist of the following information:

  1. The motif profile (a 4xL matrix)
  2. The motif score (the sum of the max probability at each position)
  3. The background profile (4x1 matrix)
  4. The consensus sequence for the motif (majority symbol at each position)
  5. The list of start positions of all the motifs in each of the sequences in the format (id,pos), one per line

Note that since a lot depends on the initial start positions, you should run the Gibbs algorithm multiple times and report the best motif found. The best motif can be found by the profile score. The higher the total sum the better the profile.

For more information on the Gibbs algorithm, consult the reading material R10.

Stopping Criteria For the stopping condition, just compare the the squared difference between the profile score for the old Motif and the new Motif after each round (after processing each sequences once), where the profile score for a motif is defined as the sum of the max probability at each position in the motif. If the difference is at most \epsilon=0.001 you may stop.

Example output

The following is the output on test.fasta

Motif Profile:
A [ 1.  0.  0.  1.  0.  0.  1.  0.  0.  0.]
C [ 0.  1.  0.  0.  1.  1.  0.  0.  1.  0.]
G [ 0.  0.  0.  0.  0.  0.  0.  1.  0.  1.]
T [ 0.  0.  1.  0.  0.  0.  0.  0.  0.  0.]
Motif Score: 10.0
Background Profile:
A 0.244
C 0.2575
G 0.229
T 0.2695
Consensus:  A C T A C C A G C G
Start Positions:
(0, 2)
(1, 2)
(2, 6)
(3, 24)
(4, 11)
(5, 29)
(6, 18)
(7, 39)
(8, 25)
(9, 12)
(10, 20)
(11, 12)
(12, 27)
(13, 25)
(14, 15)
(15, 12)
(16, 28)
(17, 16)
(18, 23)
(19, 35)
(20, 33)
(21, 37)
(22, 33)
(23, 18)
(24, 10)
(25, 4)
(26, 27)
(27, 7)
(28, 1)
(29, 39)
(30, 22)
(31, 23)
(32, 24)
(33, 28)
(34, 23)
(35, 2)
(36, 0)
(37, 20)
(38, 22)
(39, 36)
(40, 5)
(41, 11)
(42, 22)
(43, 21)
(44, 37)
(45, 30)
(46, 34)
(47, 9)
(48, 6)
(49, 11)

What to submit

Write a program called assign3-RCSID.xx (xx is 'r', 'py'). Do not hard code any file names. Your program should accept as input from the command line the length L for the motif, and the input file name (e.g., hm20r.fasta). I need to be able to run your program as "assign3-rcsid.xx L FNAME". I will try other input files and L values. For your output use L=10.

Email your script and output file to bioinfo.rpi@gmail.com. The subject of the email should be assign3-rcsid.