CBCB Seminar Series
 
Summer 2005
 
  
10:30 a.m. Tuesday June 7, 2005
Title: Using Text Analysis to 
Identify Functionally Coherent Gene Groups
 
By: Woei-Jyh (Adam) Lee
 
Venue: A.V. Williams Building Room 
3258
 
Abstract:
 
 
The analysis of large-scale genomic information (such as sequence data or 
expression patterns) frequently involves grouping genes on the basis of 
common experimental features. Often, as with gene expression clustering, 
there are too many groups to easily identify the functionally relevant 
ones. One valuable source of information about gene function is the 
published literature. We present a method, neighbor divergence, for 
assessing whether the genes within a group share a common biological 
function based on their associated scientific literature. The method uses 
statistical natural language processing techniques to interpret biological 
text. It requires only a corpus of documents relevant to the genes being 
studied (e.g., all genes in an organism) and an index connecting the 
documents to appropriate genes. Given a group of genes, neighbor 
divergence assigns a numerical score indicating how functionally coherent 
the gene group is from the perspective of the published literature. We 
evaluate our method by testing its ability to distinguish 19 known 
functional gene groups from 1900 randomly assembled groups. Neighbor 
divergence achieves 79% sensitivity at 100% specificity, comparing 
favorably to other tested methods. We also apply neighbor divergence to 
previously published gene expression clusters to assess its ability to 
recognize gene groups that had been manually identified as representative 
of a common function.
 
 
References:
 
 
Raychaudhuri S, Chang JT, Imam F, Altman RB, "The 
computational analysis of scientific literature to define and recognize 
gene expression clusters", Nucleic Acids Res. 2003 Aug 
1;31(15):4553-60.
Raychaudhuri S, Altman RB, "A 
literature-based method for assessing the functional coherence of a gene 
group", Bioinformatics. 2003 Feb 12;19(3):396-401.
Raychaudhuri S, Schutze H, Altman RB, "Using text 
analysis to identify functionally coherent gene groups", Genome Res. 
2002 Oct;12(10):1582-90.
Raychaudhuri S, Chang JT, Sutphin PD, Altman RB, "Associating genes 
with gene ontology codes using a maximum entropy analysis of biomedical 
literature", Genome Res. 2002 Jan;12(1):203-14.
 
 
  
10:30 a.m. Tuesday June 21, 2005
Title: Experimental Validation of 
Exonic Splicing Enhancers in Arabidopsis
 
By: Stephen M. Mount, Ph.D.
 
Venue: Computer Science 
Instructional Building Room 3118
 
Abstract:
 
 
Exonic splicing enhancers are important signals lying within exons that 
contribute to the accurate selection of splice sites. ESEs are much better 
characterized in animals than in plants. We have developed an in 
planta ESE assay system for the systematic assessment of candidate 
exonic splicing enhancers. Candidates ESEs were generated computationally 
(in collaboration with Salzberg and Pertea at the Institute for Genomics 
Research, ref. 2) using a combination of methods or derived from genes of 
interest (ref. 1).
 
 
Our system is based on a choice between exon inclusion and exon skipping. 
We have observed that exon inclusion in this assay is completely ESE 
dependent. Lines differing only in the 9 nt. candidate ESE insertion site 
can show either complete skipping or complete inclusion. Not all ESE 
candidates prove to have ESE activity and at least one candidate ESE 
(which was designed based on biochemical results) has activity without 
being computationally predicted. However, in every case, a point mutation 
predicted by a Gibbs sampler analysis (ELPH, ref. 3) to reduce ESE 
activity did so. This was even true for one case in which a barely 
detectable level of inclusion was eliminated by the single nucleotide 
change. Independent T-DNA insertion lines with the same construct show 
consistent results. ESE-dependence can be overcome by improvement of core 
splicing signals.
 
 
References: 
 
 
Lewandoska et al. 2004. Determinants 
of Plant U12-dependent Intron Splicing Efficiency. The Plant 
Cell 16:1340-1352.
Web-based 
Sequence Evaluator for ESE's
The ELPH Home 
Page
 
 
  
10:30 a.m. Tuesday June 28, 2005
Title: Aggressive Enumeration of 
Peptide Sequences for Peptide Identification by Tandem Mass Spectrometry
 
By: Nathan Edwards, Ph.D.
 
Venue: Computer Science 
Instructional Building Room 3118
 
Abstract:
 
 
Peptide identification from tandem mass spectra is a critical part of 
comprehensive proteome analyses. The search engines that analyze these 
spectra, such as Mascot, SEQUEST, X!Tandem, or OMSSA, use amino-acid 
sequence databases, such as SwissProt, to provide putative peptides to 
compare against each spectrum. This approach fails to identify peptides 
missing from the sequence database. We argue that amino-acid sequence 
databases used for peptide identification should be aggressively inclusive 
of potential peptide sequences, rather than conservative, and show that 
this need not increase search engine running times significantly.
 
 
We begin with a whirlwind tour of methods used to construct the popular 
protein sequence databases to understand why peptide sequences might be 
left out. We show how different types of peptide sequence evidence might 
be aggressively integrated into an inclusive peptide sequence database. 
Further, we demonstrate that efficiently represented, an inclusive 
amino-acid sequence database of peptides can, in some cases, be smaller 
than sequence databases in common useage.
 
 
On going research with Chau-Wen Tseng and Xue Wu.
 
 
  
10:30 a.m. Tuesday July 5, 2005
Title: Feature Generation for 
Sequences with Application to Splice-Site Prediction
 
By: Rezarta Islamaj
 
Venue: Computer Science 
Instructional Building Room 3118
 
Abstract:
 
 
In this talk I will present a new approach to feature selection for 
sequence data.
 
 
We identify general feature categories and give construction algorithms 
for them. We show how they can be integrated in a system that tightly 
couples feature construction and feature selection. This integrated 
process, which we refer to as feature generation, allows us to 
systematically search a large space of potential feature sets. We 
demonstrate the effectiveness of our approach for an important component 
of the gene finding problem, splice-site prediction. We show that 
predictive models built using our feature generation algorithm achieve a 
significant improvement in accuracy over existing, state-of-the-art 
approaches.
 
 
  
10:30 a.m. Tuesday July 19, 2005
Title: Assigning biological 
function to genomic sequence: adventures in computational gene finding and 
alternative splicing (Part I)
 
By: Jonathan Allen
 
Venue: Computer Science 
Instructional Building Room 3118
 
Abstract:
 
 
In this talk I will discuss a new approach to gene finding using a gene 
structure annotation database. The computational method is implemented in 
the freely available open source software package JIGSAW. I will show how 
JIGSAW is an extension to the computational gene finding problem and 
review our current prediction performance in Human. In the second half of 
the talk I will discuss the application of gene finding principles to 
develop new computational models of alternative splicing using Hidden 
Markov Models informed by cross species sequence conservation.
 
 
  
10:30 a.m. Tuesday August 9, 2005
Title: Assigning biological 
function to genomic sequence: adventures in computational gene finding and 
alternative splicing (Part II)
 
By: Jonathan Allen
 
Venue: Computer Science 
Instructional Building Room 3118
 
Abstract:
 
 
In this talk I will discuss a new approach to gene finding using a gene 
structure annotation database. The computational method is implemented in 
the freely available open source software package JIGSAW. I will show how 
JIGSAW is an extension to the computational gene finding problem and 
review our current prediction performance in Human. In the second half of 
the talk I will discuss the application of gene finding principles to 
develop new computational models of alternative splicing using Hidden 
Markov Models informed by cross species sequence conservation.
 
               | 
             
          
         
         |