Assign3: Protein Embeddings
Due Date: Feb 22, before midnight (EST 11:59:59PM)
In this assignment, you will implement the ProtVec embedding method described in the paper Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics
Your code should implement the negative sampling approach to train the embeddings. For training the model, you can first use a small set of 1000 proteins small_uniprot.txt. Once your model is finalized you should train it on the large set of 524532 protein sequences uniprot-reviewed-lim_sequences.txt. This data is from the paper Learned protein embeddings for machine learning.
You should compare with the default values used in the paper, namely embedding dimensionality $d=100$, window size $w=25$, and k-mer/n-gram size $k=3$, and the number of negative samples per positive example $q=5$.
Here is the pseudo code for the overall method:
create the vocab, probability distribution, and word to index (and reverse mappings) for each k-mer in each sequence at each of the offsets: 0, 1, 2. write a function to return a batch of positive and negative pairs from all of the sequences/offsets. for the negative sampling use the cumsum approach to sample random words define NN model: init function: define the two embeddings layers (T, C) forward function: input is a batch of target_words, and context_words look up their embeddings compute the dot product between corresponding pairs output should be the probability of that pair being a positive pair (via sigmoid), or leave it as logits Next is the boilerplate code for training: net = model(parameters) send net to GPU loss_func = BCE loss (with logits) optimizer = optim.Adam for e in epochs for target_words, context_words, labels in batches send batch data to the GPU net.zero_grad preds = net (target_words, contex_words) loss = loss_func(preds, labels) loss.backward optimizer.step print total loss per epoch save embeddings visualize embeddings via t-SNE
Submit your jupyter notebook (or python script) via submitty.
The output of your code should be a file that contain the embedding of each word. The first line should have only two values:
where V is the Vocab size, and d the EMBED_DIM
Next, each line should contain:
where WORD is a word from your vocab (not the index), and its embedding vector. For example, if there are only two words in your vocab (say AA and BB), and you are doing 3-dim embeddings, then the output file will be:
2 3 AA -1 -0.3 5 BB 2 0.5 -1
After learning the representations, we will use the trained vectors for some downstream tasks in future assignments. For this assignment you should visualize your embeddings using t-SNE.
BTW, you should acknowledge the source of any code you use from the web.