Recent Changes - Search:

Main Page

Syllabus

Readings

Papers

Piazza Site

Submitty

Assignments

edit SideBar

Assign1

Assignment 1: Due Date: 25th Jan, 2019, just before midnight

The cytochrome c protein is an iron containing protein that is involved in ATP production for energy within cells. In this assignment we will compare the similarity between the cytochrome c oxidase, subunit I gene found in the Human mitochondrial (mtDNA) genome and in the cyanobacteria Prochlorococcus marinus. There are two objectives for this assignment:

  • get familiarity with biopython sequence manipulation and search tools
  • learn how to program the dynamic programming algorithm for sequence alignment

You will write a single python script to accomplish the following tasks.


Part I: Sequence Feature Parsing

Use biopython Entrez querying functionality to download the complete genomes for Human mtDNA and P. marinus. Their genebank IDs are KC417443.1 and CP000576.1, respv. For your reference, here are the links to the complete genomes:

Your goal is the extract both the gene sequence, and protein translation from the biopython features associated with each genome, for the cytochrome c subunit I. This gene is named COX1 in humans and cyoB in the cyanobacteria. Pay close attention to whether the gene in on the main strand or the complementary strand.

Your code should retrieve the complete genome from Entrez genebank, and store it in a file named after the genebank IDs. If the file is already in the local directory, it should not be downloaded again. Next, parse the genome using biopython SeqIO routines to extract the gene and protein sequences for the two genes mentioned above. For the gene sequence look at the gene features, and for the protein sequence look up the CDS features in the genbank file.


Part II: Global Sequence Alignment

In your script write a subroutine that takes as input two sequences, a scoring scheme, and a (linear) gap penalty. Implement the dynamic programming algorithm for global alignment. For DNA sequences, the scoring scheme will be a simple match score and mismatch score. For Protein sequences, use the blosum62 scoring matrix, which is already coded in biopython in Bio.SubsMat.MatrixInfo. Your code should print the alignment information in the precise format shown below, for two example sequences:

  • seq1: CATAAGCTTCTGACTCTTACCTCCCTCTCTCCTACTCCTGCTCGCATCTGCTATAGTGGAGGCCGGAGCAGGAACAGGTTGAACAG
  • seq2: CGTAGCTTTTTGGTTAATTCCTCCTTCAGGTTTGATGTTGGTAGCAAGCTATTTTGTTGAGGGTGCTGCTCAGGCTGGATGGA

Assuming match score=1, mismatch score=0, gap = -2, the alignment should be printed as follows:

Best Global Alignment:  score= 35.0 len= 86 matches= 41

CATAAGCTTCTGACTCTTACCTCCCTCTCTCCTACTCCTGCTCGCATCTGCTATAGTGGAGGCCGGAGCA
| ||   || ||  |  | ||||| ||     |  |  || | |||   |||||  | |  |  || ||
CGTAGCTTTTTGGTTAATTCCTCCTTCAGGTTTGATGTTGGTAGCA--AGCTAT-TTTGTTGAGGGTGCT

GGAACAGGTTGAACAG
|     | | ||
GCTCAGGCTGGATGGA

As you can see, there are only 70 characters/symbols printed per line. Seq1 is on the top and Seq2 on the bottom. The '|' symbols in the middle denote exact matches. The optimal 'score' is shown, followed by the length 'len' of the alignment, and the number of exact 'matches'. You should code your own algorithm, but you can verify the results using biopython's inbuilt Bio.pairwise2.align functions.


What to turn in

Submit you assignment via the computational biologysubmitty page.

You should turn in a single python script named assign1.py. The script should accept as input parameters, the match score MATCH, mismatch score MISMATCH and the DNA gap penalty DGAP, and the protein gap penalty PGAP. The script will be run as:

  • RCSID-hw1.py MATCH MISMATCH DGAP PGAP

Note that the match/mismatch score and DGAP apply to the DNA sequence alignment, whereas for the protein sequences, you have to use the blosum62 scoring matrix and PGAP. You can hard code the two genome IDs (and the gene names). Your script will check if the genome files are found in the local dir, and if not will download them from Entrez. The script should then extract the gene and protein sequences for the names genes, and then print the global alignment between the gene sequences, and then the protein sequences. Try different values for the match/mismatch and gap scores for the DNA/protein sequences. Show your output on what you consider to be a good scoring scheme and gap penalty. What can you conclude about the DNA and protein similarity?

Do not submit the genome files; only the script and output file.


Useful resources

You may also use any web tutorial and suggestions to help you parse and implement the code. However, DO NOT copy the DP code from anywhere in the web. All work must be your own.

Edit - History - Print - Recent Changes - Search
Page last modified on January 18, 2019, at 11:39 PM