Search:

# Motif Discovery

Due Date: Feb 22th, before Midnight

The Escherichia coli cyclic AMP receptor protein (CRP) is an important transcription factor that regulates transcription initiation for more than 100 genes mainly involved in catabolism of carbon sources other than glucose. CRP regulates transcription initiation by binding to DNA, and its binding site motif is as shown in the sequence logo:

In this assignment you will implement the Gibbs Sampling based algorithm to discover the most prominent motif of length $$L$$ in a given dataset of DNA sequences. In particular, we are interested in Gibbs sampling to find the motif of length $$L=22$$ in the CRP promoter sequences Attach:CRP-ECRDB62A-intg.fasta

For more information on the Gibbbs sampling algorithm, consult Sec. 10.6.1, Figure 10.9, in the supplementary reading material R11.

Given the profile matrix for a motif, define the profile score of the motif as the sum of the max log(probability) at each position in the motif. More formally, given a $$4 \times L$$ profile M, where $$M[i,j]$$ records the probability of observing base $$i$$ as position $$j$$ in the motif, we define the profile score as follows:
$$score(M) = \sum_{j=1}^L \log ( \max_{i=1}^4 \{ M[i,j] \} )$$

Stopping Criteria: For the stopping condition, just compare the the squared difference between the profile score for the old Motif and the new Motif after each round (after processing each of the sequences once). If the difference is equal to or less than $$\epsilon$$ you may stop.

Note that since a lot depends on the initial start positions, you should run the program multiple times (using the parameter $$R$$) and report the best motif found. The output should consist of the following information for the best motif:

1. The motif profile (a $$4 \times L$$ matrix)
2. The motif profile score
3. The background profile ( $$4\times 1$$ matrix)
4. The consensus sequence for the motif
5. The list of sequence id, start position, and the score of the $$L$$-length match in the format (id,pos,score), one per line, where score is $$\log(P(M))$$, i.e., the log of the probability of that L-length subsequence at position pos according to the profile matrix

### Example Output

Gibbs-Sampling output on Attach:example.fasta using $$L=10$$, $$R=50$$ and $$\epsilon=0.001$$ is as follows:

Motif Profile:
A [ 0.84  0.    0.08  0.86  0.08  0.    0.8   0.    0.06  0.14]
C [ 0.04  0.96  0.12  0.1   0.72  0.84  0.08  0.06  0.76  0.  ]
G [ 0.12  0.    0.04  0.    0.12  0.1   0.    0.88  0.08  0.86]
T [ 0.    0.04  0.76  0.04  0.08  0.06  0.12  0.06  0.1   0.  ]
Motif Score: -1.91952922948
Background Profile:
A 0.252
C 0.252
G 0.253
T 0.243
Consensus:  A C T A C C A G C G
Start Positions:
(0, 11) -6.09943968003
(1, 34) -10.6763060899
(2, 13) -6.45570654398
(3, 35) -5.89358762583
(4, 6) -6.73333828058
(5, 12) -4.17082102809
(6, 3) -4.22211432247
(7, 20) -12.4461336824
(8, 37) -12.7447725533
(9, 38) -5.09758305983
(10, 5) -6.06794101297
(11, 32) -5.76255936342
(12, 19) -6.0138737917
(13, 1) -1.91952922948
(14, 8) -6.8098783577
(15, 9) -4.22211432247
(16, 6) -3.81664921436
(17, 27) -1.91952922948
(18, 4) -5.83952040456
(19, 17) -3.76535591998
(20, 35) -6.35034602832
(21, 3) -6.80233115207
(22, 11) -4.17082102809
(23, 39) -4.11675380681
(24, 21) -4.45850310054
(25, 18) -3.94767747677
(26, 35) -5.98611099472
(27, 22) -4.07129143274
(28, 20) -4.17082102809
(29, 18) -4.04776093533
(30, 27) -6.06266395587
(31, 35) -5.68072934517
(32, 36) -1.91952922948
(33, 6) -6.80287213125
(34, 13) -4.17082102809
(35, 25) -7.45423537409
(36, 30) -5.96258049731
(37, 12) -6.65572767787
(38, 1) -5.93204377345
(39, 40) -14.072863092
(40, 4) -1.91952922948
(41, 20) -1.91952922948
(42, 12) -6.58673480639
(43, 36) -7.04349320888
(44, 5) -15.3125668629
(45, 25) -1.91952922948
(46, 11) -3.73481919612
(47, 10) -4.07129143274
(48, 11) -5.88658139938
(49, 34) -4.11675380681


### What to submit

Use submitty to submit the script file named Assign3.py, and the output file (Assign3.txt).

Your code will be run as:
Assign3-RCSID.py FILE $$L$$ $$\epsilon$$ $$R$$

where FILE is the set of input sequences in FASTA format (each sequence has a header line beginning with '>' and some name/other info and the next line has the actual sequence; there are multiple such sequences in the same file), L is the motif length, $$\epsilon$$ is the stopping threshold, and $$R$$ is the number of runs to try.

Show your output on the CRP file Attach:CRP-ECRDB62A-intg.fasta using $$L=22$$.