Due Date: Mar 25th, before Midnight
In this assignment you will implement the Gibbs Sampling algorithm to discover the most prominent motif in the dataset hm20r.fasta. The dataset consists of real upstream regions from human genes.
The output should consist of the following information:
Note that since a lot depends on the initial start positions, you should run the Gibbs algorithm multiple times and report the best motif found. The best motif can be found by the profile score. The higher the total sum the better the profile.
For more information on the Gibbs algorithm, consult the reading material R10.
Stopping Criteria For the stopping condition, just compare the the squared difference between the profile score for the old Motif and the new Motif after each round (after processing each sequences once), where the profile score for a motif is defined as the sum of the max probability at each position in the motif. If the difference is at most \epsilon=0.001 you may stop.
The following is the output on test.fasta
Motif Profile: A [ 1. 0. 0. 1. 0. 0. 1. 0. 0. 0.] C [ 0. 1. 0. 0. 1. 1. 0. 0. 1. 0.] G [ 0. 0. 0. 0. 0. 0. 0. 1. 0. 1.] T [ 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.] Motif Score: 10.0 Background Profile: A 0.244 C 0.2575 G 0.229 T 0.2695 Consensus: A C T A C C A G C G Start Positions: (0, 2) (1, 2) (2, 6) (3, 24) (4, 11) (5, 29) (6, 18) (7, 39) (8, 25) (9, 12) (10, 20) (11, 12) (12, 27) (13, 25) (14, 15) (15, 12) (16, 28) (17, 16) (18, 23) (19, 35) (20, 33) (21, 37) (22, 33) (23, 18) (24, 10) (25, 4) (26, 27) (27, 7) (28, 1) (29, 39) (30, 22) (31, 23) (32, 24) (33, 28) (34, 23) (35, 2) (36, 0) (37, 20) (38, 22) (39, 36) (40, 5) (41, 11) (42, 22) (43, 21) (44, 37) (45, 30) (46, 34) (47, 9) (48, 6) (49, 11)
Write a program called assign3-RCSID.xx (xx is 'r', 'py'). Do not hard code any file names. Your program should accept as input from the command line the length L for the motif, and the input file name (e.g., hm20r.fasta). I need to be able to run your program as "assign3-rcsid.xx L FNAME". I will try other input files and L values. For your output use L=10.
Email your script and output file to bioinfo.rpi@gmail.com. The subject of the email should be assign3-rcsid.