Recent Changes - Search:

Main Page

Syllabus

Readings

Papers

Piazza Site

Submitty

Assignments

edit SideBar

Assign5

Assign 5 ( Due: 4/10/19) and Project ( Due: 4/24/19, with an update on 4/17/19)

The goal of this project is to explore the efficacy of deep learning for inferring the universal genetic code. The aim is for the deep neural networks to automatically learn to translate DNA into proteins.

The data we will be using is the complete genome of Yeast from NCBI. In particular, the yeast genome contains 5018 genes with translated proteins. The file yeast-genes-prots.txt contains the complete set of (gene, protein) pairs for yeast. Each line contains three space separated entries:
geneid SPACE gene-sequence SPACE translated-protein-sequence

For Assign5 you will implement a seq2seq model in Keras to translate the genes into their corresponding proteins. You should use Google colab and execute the code on their K80 GPUs. See https://medium.com/deep-learning-turkey/google-colab-free-gpu-tutorial-e113627b9f5d for a tutorial on using GPUs on colab. The great advantage is that all of the required python packages are already available for import on the colab (numpy, keras, tensorflow, etc.) and therefore you do not have to expend effort to provision a GPU machine.

A good place to start with the seq2seq model in keras is https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html. The corresponding code on github can be found at https://github.com/keras-team/keras/blob/master/examples/lstm_seq2seq.py.

For assign5, you need to modify the code to run on our data, and to add an evaluation on a test set. That is, split the above gene-protein pairs into a training and test set. Use the first 90% of the pairs for training, and keep aside the last 10% of the pairs for testing. Further, during training, you may choose to keep 20% of the training data for validation, i.e., to see how well your model does on the validation (or development) set, and to tweak the model parameters to get the best model. Once you have a reasonable model, you will do a final evaluation on the test set.

One challenge of applying seq2seq on our data is that the gene sequences can be very long (up to length 15000), which will make learning off-the-shelf very hard. There are several ways to address this:

  1. keep only those gene-protein pairs where the genes are no longer than 2000 bases (or 3000 bases)
  2. use only the first 60-99 bases for the genes (and the corresponding 20-33 bases for the proteins) to train your initial models, and get a feel for the parameters.
  3. chop each of the gene-protein pairs into shorter non-overlapping fragments of length 60-100 and then train the model on this new dataset of gene-protein fragments.

For both the assignment and the project you can make use of any public code to implement the ideas (but document each source in your report).


Assign5: What is Due

The goal of assign5 is to get you started on using the google colab platform, and implement the basic seq2seq model in keras and evaluate the performance on the test set. For the test set report the loss, and the accuracy of the predicted/translated protein, i.e., how many matches are there between the true and translated protein sequence for a given gene sequence. You can also compute the alignment score using BLOSUM62 matrix and report that.

Submit your script and training and testing performance on the yeast data.


Project: What is Due

For the project you need to try various enhancements to the basic seq2seq model to improve the model performance. some ideas to try include:

  1. Attention: add an attention mechanism to the seq2seq model. See https://machinelearningmastery.com/encoder-decoder-attention-sequence-to-sequence-prediction-keras/
  2. Stacking: Try multilayer encoders and decoders, e.g., 2 or more LSTM layers for encoder, and 2 or more LSTM layers for decoders.
  3. Bidirectionality: Consider using biLSTMs
  4. Embeddings: try better encoding of the DNA and protein sequences by adding embedding layers (e.g., using word2vec approach)
  5. Alternative models: Try CNNs (1D convolution layers with different window sizes, etc.)

Your goal is to try several of these ideas and evaluate which improve the performance and by how much. There will be an update due on 17th April. Finally, you will write a 4-5 page report detailing your approach and finding, which will be due on April 24th. Also, you will present your project in class on April 26th (the last day of classes).

Edit - History - Print - Recent Changes - Search
Page last modified on April 03, 2019, at 04:08 PM