Due Date: Feb 4th, before midnight
In this assignment, you will implement the ProtVec embedding method described in https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0141287 (Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics).
You will use http://www.pytorch.org (pytorch) for the implementation, and your code should implement the negative sampling approach to train the embeddings.
For training the model, you can first use a small set of 1000 proteins http://www.cs.rpi.edu/~zaki/MLIB/assignments/small_uniprot.txt. Once your model is finalized you should train it on the large set of 524532 protein sequences http://www.cs.rpi.edu/~zaki/MLIB/assignments/uniprot-reviewed-lim_sequences.txt. This data is from the paper https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6061698/ (Learned protein embeddings for machine learning)
You will be asked to compare the embeddings for different values of the dimensionality $d$, different context size $w$, and different n-gram sizes $n$. For example $d=50, 100, 300$, $w=5, 7, 25$, and $n=1,2,3,4,5$.
You should compare with the default values used in the paper, namely $d=100$, $w=25$, and $n=3$. In your implementation these should be variables that take their values from the command line (see below; you should also make the number of negative samples a parameter).
Here is the pseudo code for the overall structure of the script:
Submit your code via submitty. Name your python script: assign1.py.
Your script will be run as follows:
assign1.py FILENAME EMBED_DIM CONTEXT_SZ NGRAM_SIZE NEG_SAMPLES
Here FILENAME is the name of the sequence file, EMBED_DIM the dimensionality to use for the embedding vectors, CONTEXT_SZ is the size of the context to consider, NGRAM_SZ is the size of the ngrams, and NEG_SAMPLES is the number of negative samples to consider for each positive pair.
Note that CONTEXT_SZ will always be an odd number greater than 1, so CONTEXT_SZ=3 means that you look at the center word and plus/minus one word, CONTEXT_SZ=25 means center word plus/minus 12 words, and so on.
The output of your code should be a file that contain the embbedding of each work. The first line should have only two values:
where V is the Vocab size, and d the EMBED_DIM
Next, each line should contain:
where WORD is a word from your vocab (not the index), and its embedding vector. For example, if there are only two words in your vocab (say AA and BB), and you are doing 3-dim embeddings, then the output file will be:
AA -1 -0.3 5
BB 2 0.5 -1
After learning the representations, you will be asked to use the trained vectors for a downstream task such as sequence classification or secondary structure prediction. The details of the tasks will be posted for Assign2
Note that you are allowed to use/modify existing implementations of word2vec in pytorch on the web, but you should understand what is being done, so that you are able to code more complex models later. You should acknowledge the source of any code you use.