Sequence Retrieval and Alignment: Program 1 (GenSeq): ------------------- 1) Write a program to connect to the NCBI GenBank server and retrieve the DNA sequence for a given Sequence ID. 2) Parse the GenBank Format and extract the DNA and Protein Sequence Corresponding to a given protein ID. 3) Use the DNA sequence from above and translate to a protein sequence using the universal genetic code. You will need access to a different translation table available at the link under transl_table=11 (where 11 is a clickable entry on the GenBank page) The translation table is also available at the following link http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=t#SG11 Keep in mind that if the protein is on the "complementary strand" you have to first take its complement and read it in reverse during translation. Make sure that your translation matches the one shown in the protein "translation" part of the CDS for the gene. Input format: genseq seqID protID Output format: DNA sequence in 5' to 3' direction Translation: a list of Codon -> Amino Acid pairs (one per line) Protein Sequence Program 2 (Align): ------------------ 1) Write a program to align any two protein sequences. The program should be able to handle global, semi-global and local alignment options. It should use the BLOSUM62 matrix, which you can hard code inside your program. The program should print the alignment and its score, (the blosum62 matrix can be found on the web). Input: align seq1 seq2 gap_penalty where X will be one of: local, semiglobal or global Output: alignment of seq1 with seq2, followed by total score e.g. FAG_CS_IL FG_KCS_IK Score: 100 Actual Run and Suggestions: --------------------------- 1) Run Prog1 on the following two sequences and their proteins: a) Bacteriophage Lambda, accession number NC_001416 CDS with protein_id NP_040628.1 b) Enterobacteria Phage P22, accession number NC_002371 Gene "c2" with protein_id NP_059606.1 so first run will be : getseq NC_001416 NP_040628.1 and second run will be: getseq NC_002371 NP_059606.1 2) Run Prog 2 on the Protein sequences from 1a) and 1b) with gap penalty of -10 and try all three alignment options (local, global, semiglobal) 2) Before doing anything it would be best to go to NCBI and do some web searches by hand, and look at the specific regions of interest. 3) It would be very easy to code Program 1 in bioperl 4) You can use any language for Program 2 What to Submit: -------------- 1) Due date: Monday, 10th Feb 2003 2) Give me a hard copy of your output in class on Monday. 3) Zip or tar the entire directory with your source code and email to me. Make sure you DO NOT send me bioperl installation in case you use bioperl. I only want your programs. email me the directory BEFORE class on monday. Collaboration/Web Resources: ---------------------------- 1) This is an individual project. While you can discuss an approach with your class mates, all coding should be your own. No non-trivial similarity in code will be tolerated. 2) You may not use any resorce/code from the web for the alignment program. It should be your own implementation. Do not use implementations provided by BTL or bioperl. Questions: ----------- Send questions to zaki.AT.cs.rpi.edu (replace AT with @) if something is unclear.