CBCB Seminar Series
Summer 2005
10:30 a.m. Tuesday June 7, 2005
Title: Using Text Analysis to
Identify Functionally Coherent Gene Groups
By: Woei-Jyh (Adam) Lee
Venue: A.V. Williams Building Room
3258
Abstract:
The analysis of large-scale genomic information (such as sequence data or
expression patterns) frequently involves grouping genes on the basis of
common experimental features. Often, as with gene expression clustering,
there are too many groups to easily identify the functionally relevant
ones. One valuable source of information about gene function is the
published literature. We present a method, neighbor divergence, for
assessing whether the genes within a group share a common biological
function based on their associated scientific literature. The method uses
statistical natural language processing techniques to interpret biological
text. It requires only a corpus of documents relevant to the genes being
studied (e.g., all genes in an organism) and an index connecting the
documents to appropriate genes. Given a group of genes, neighbor
divergence assigns a numerical score indicating how functionally coherent
the gene group is from the perspective of the published literature. We
evaluate our method by testing its ability to distinguish 19 known
functional gene groups from 1900 randomly assembled groups. Neighbor
divergence achieves 79% sensitivity at 100% specificity, comparing
favorably to other tested methods. We also apply neighbor divergence to
previously published gene expression clusters to assess its ability to
recognize gene groups that had been manually identified as representative
of a common function.
References:
Raychaudhuri S, Chang JT, Imam F, Altman RB, "The
computational analysis of scientific literature to define and recognize
gene expression clusters", Nucleic Acids Res. 2003 Aug
1;31(15):4553-60.
Raychaudhuri S, Altman RB, "A
literature-based method for assessing the functional coherence of a gene
group", Bioinformatics. 2003 Feb 12;19(3):396-401.
Raychaudhuri S, Schutze H, Altman RB, "Using text
analysis to identify functionally coherent gene groups", Genome Res.
2002 Oct;12(10):1582-90.
Raychaudhuri S, Chang JT, Sutphin PD, Altman RB, "Associating genes
with gene ontology codes using a maximum entropy analysis of biomedical
literature", Genome Res. 2002 Jan;12(1):203-14.
10:30 a.m. Tuesday June 21, 2005
Title: Experimental Validation of
Exonic Splicing Enhancers in Arabidopsis
By: Stephen M. Mount, Ph.D.
Venue: Computer Science
Instructional Building Room 3118
Abstract:
Exonic splicing enhancers are important signals lying within exons that
contribute to the accurate selection of splice sites. ESEs are much better
characterized in animals than in plants. We have developed an in
planta ESE assay system for the systematic assessment of candidate
exonic splicing enhancers. Candidates ESEs were generated computationally
(in collaboration with Salzberg and Pertea at the Institute for Genomics
Research, ref. 2) using a combination of methods or derived from genes of
interest (ref. 1).
Our system is based on a choice between exon inclusion and exon skipping.
We have observed that exon inclusion in this assay is completely ESE
dependent. Lines differing only in the 9 nt. candidate ESE insertion site
can show either complete skipping or complete inclusion. Not all ESE
candidates prove to have ESE activity and at least one candidate ESE
(which was designed based on biochemical results) has activity without
being computationally predicted. However, in every case, a point mutation
predicted by a Gibbs sampler analysis (ELPH, ref. 3) to reduce ESE
activity did so. This was even true for one case in which a barely
detectable level of inclusion was eliminated by the single nucleotide
change. Independent T-DNA insertion lines with the same construct show
consistent results. ESE-dependence can be overcome by improvement of core
splicing signals.
References:
Lewandoska et al. 2004. Determinants
of Plant U12-dependent Intron Splicing Efficiency. The Plant
Cell 16:1340-1352.
Web-based
Sequence Evaluator for ESE's
The ELPH Home
Page
10:30 a.m. Tuesday June 28, 2005
Title: Aggressive Enumeration of
Peptide Sequences for Peptide Identification by Tandem Mass Spectrometry
By: Nathan Edwards, Ph.D.
Venue: Computer Science
Instructional Building Room 3118
Abstract:
Peptide identification from tandem mass spectra is a critical part of
comprehensive proteome analyses. The search engines that analyze these
spectra, such as Mascot, SEQUEST, X!Tandem, or OMSSA, use amino-acid
sequence databases, such as SwissProt, to provide putative peptides to
compare against each spectrum. This approach fails to identify peptides
missing from the sequence database. We argue that amino-acid sequence
databases used for peptide identification should be aggressively inclusive
of potential peptide sequences, rather than conservative, and show that
this need not increase search engine running times significantly.
We begin with a whirlwind tour of methods used to construct the popular
protein sequence databases to understand why peptide sequences might be
left out. We show how different types of peptide sequence evidence might
be aggressively integrated into an inclusive peptide sequence database.
Further, we demonstrate that efficiently represented, an inclusive
amino-acid sequence database of peptides can, in some cases, be smaller
than sequence databases in common useage.
On going research with Chau-Wen Tseng and Xue Wu.
10:30 a.m. Tuesday July 5, 2005
Title: Feature Generation for
Sequences with Application to Splice-Site Prediction
By: Rezarta Islamaj
Venue: Computer Science
Instructional Building Room 3118
Abstract:
In this talk I will present a new approach to feature selection for
sequence data.
We identify general feature categories and give construction algorithms
for them. We show how they can be integrated in a system that tightly
couples feature construction and feature selection. This integrated
process, which we refer to as feature generation, allows us to
systematically search a large space of potential feature sets. We
demonstrate the effectiveness of our approach for an important component
of the gene finding problem, splice-site prediction. We show that
predictive models built using our feature generation algorithm achieve a
significant improvement in accuracy over existing, state-of-the-art
approaches.
10:30 a.m. Tuesday July 19, 2005
Title: Assigning biological
function to genomic sequence: adventures in computational gene finding and
alternative splicing (Part I)
By: Jonathan Allen
Venue: Computer Science
Instructional Building Room 3118
Abstract:
In this talk I will discuss a new approach to gene finding using a gene
structure annotation database. The computational method is implemented in
the freely available open source software package JIGSAW. I will show how
JIGSAW is an extension to the computational gene finding problem and
review our current prediction performance in Human. In the second half of
the talk I will discuss the application of gene finding principles to
develop new computational models of alternative splicing using Hidden
Markov Models informed by cross species sequence conservation.
10:30 a.m. Tuesday August 9, 2005
Title: Assigning biological
function to genomic sequence: adventures in computational gene finding and
alternative splicing (Part II)
By: Jonathan Allen
Venue: Computer Science
Instructional Building Room 3118
Abstract:
In this talk I will discuss a new approach to gene finding using a gene
structure annotation database. The computational method is implemented in
the freely available open source software package JIGSAW. I will show how
JIGSAW is an extension to the computational gene finding problem and
review our current prediction performance in Human. In the second half of
the talk I will discuss the application of gene finding principles to
develop new computational models of alternative splicing using Hidden
Markov Models informed by cross species sequence conservation.
|
|