CMSC 828N Laboratory 3, due Friday, Nov 5.

This lab will involve writing your own, simple, bacterial gene finding
program.  Submit your answers as a SINGLE tarfile, attached to your
email, which should be submitted to the TA before midnight on the due date.  
Please include a file named README explaining what the files are and
how to run your code.   

INPUT

Your target genome is Staphylococcus aureus strain COL from NCBI at:
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Staphylococcus_aureus_COL

The first file you need is NC_002951.fna, which is the fasta file of
the genome itself.  You will also need NC_002951.ptt, which is the
"protein translation table," a file listing all the annotated protein-coding
genes.  Staph is a relatively small bacterial genome.  This particular
strain, S. aureus COL, is one of the early methocillin-resistant
(MRSA) strains, a deadly variety of S. aureus that is endemic in hospitals
and many communities around the world.

1. Write a program to find all open reading frames (ORFs) in the
genome.  Remember that this is a CIRCULAR chromosome, so ORFs can span
the endpoint of the string.  Also note that you need to find ORFs on
both the forward and the reverse strand.  The program should accept a
minimum ORF length as a parameter; thus a minimum of 200 means that it
will only identify ORFs that are 200bp or longer.  Output should
include the ORF as a 4-tuple: a unique ID, two coordinates (start
codon position coordinate first), and either "+" or "-" to indicate
whether the ORF is on the forward or reverse strand.
	
For this assignment, an ORF is a maximal-length substring that begins at a
START codon and ends at a STOP, with no STOPs occurring in-frame prior
to the final STOP.  Maximal here means that no ORF is properly
contained in another on the same DNA strand: in other words, use the
start codon that makes the longest ORF.  START codons must be one of
(ATG,GTG) and STOP codons must be one of (TAA,TAG,TGA).  NOTE:
coordinates of the ORFs should INCLUDE the stop codon.  Thus if the
start codon (ATG) occurs at positions 101,102,103, and the STOP is at
161,162,163, then the ORF is the closed interval [101,163], and has
length 63.  

Note: use 1-based coordinates for your genomes.  Thus the first base in
the genome is position 1, not zero.

What to turn in:
(a) Your program
(b) A file showing all ORFs of 300bp or longer in the target genome. 
Each line should look like this:
ORF0001 45123 46123 +
This file should be sorted by column 2, with one ORF per line.  Note
that because column 2 indicates the START codon and column 3 the STOP,
column 2 will be larger than column 3 for genes on the reverse strand.

2. Codon usage is a measurement of how frequently each of the 61
possible codons are used.  Compute the "expected" frequencies for 
all 61 codons as follows.  First, collect all ORFs from part 1
that are 750 base pairs in length or longer.  (Ideally we would use a
set of "known" genes to compute these values, but instead we will use
these "long" ORFs as a surrogate.) 

Next, count how many times each codon is used in the entire set of 
long ORFs.  In order to avoid division by zero later, add a pseudo-count
of 1 to each of these codon counts.  Thus all your codon counts
will be at least 1.  Then, to determine the expected frequency of
each codon, divide these counts by the total number of codons 
in these ORFs.  You will now have 61 probabilities, one for each codon.

Next, using the ORF list from part 1, count the occurrences of the 61 
codons for each ORF.  Thus if a codon occurs one time in
an ORF with 200 codons, the count will be 1.  (Don't use pseudo-counts
for this part.)  Then compare each ORF's codon usage to the expected 
frequencies as follows, by computing the chi-squared statistic of 
the difference between the two distributions. The chi-squared statistic is:

	SUM[(observed-expected)^2/expected]

where the SUM is computed over all 61 codons.  The "expected" number
of occurrences for any codon in an ORF is the expected frequency of 
that codon (computed above) times the number of codons in the ORF.
For example, if the expected frequency of CCC is 2%, then for an ORF
with 200 codons you would expect 4 of its codons to be CCC.

Finally, note that a "typical" ORF will have a low chi-squared statistic.

What to turn in:
(a) your program
(b) a file with the 61 expected frequencies for codons.  Each line should have
two values, a codon and its frequency.  Sort this alphabetically by codon.
(c) A file with all the chi-squared scores for the ORFs output in part 1.

For (c), create a file of all ORFs as in part 1, but this time output five
columns: the ORF ID, coordinates, orientation, and the chi-square
statistic for that ORF.  Note that the first four columns are
identical to those for part 1.

3. (Simple gene finder).  For each ORF in the ORF list from parts 1 and 2, 
determine whether or not it is a gene by the following very simple test:
(a) if the ORF doesn't overlap another ORF on either strand, then call 
    it a gene.
(b) if the ORF overlaps other ORFs, and if it has a lower chi-squared 
    statistic than any of its overlapping ORFs, then call it a gene.
Note that you should only need the output of part 2 as input to this program.

Finally, compare these predictions to the annotation in GenBank for the target
genome.  For each ORF in your list, consider it correct if it has the same STOP 
codon as the gene in the ptt file.  Don't worry about the start codons.  
Compute the number of genes from your list that match the annotation.

What to turn in:
(a) your program
(b) A file with all the "genes", in the same format as the output from part 2.
(c) Report the number of genes in your predictions that match annotation as three
values: 
  (c.1) the total count of correctly predicted genes, where "correct" is defined as above
  (c.2) the sensitivity (total count in c.1 divided by number of genes listed 
  in the ptt file)
  (c.3) the precision (total count in c.1 divided by the number of genes you predicted in (b).