CBCB Seminar Series


Summer 2005



10:30 a.m. Tuesday June 7, 2005

Title: Using Text Analysis to Identify Functionally Coherent Gene Groups
By: Woei-Jyh (Adam) Lee
Venue: A.V. Williams Building Room 3258
Abstract:

The analysis of large-scale genomic information (such as sequence data or expression patterns) frequently involves grouping genes on the basis of common experimental features. Often, as with gene expression clustering, there are too many groups to easily identify the functionally relevant ones. One valuable source of information about gene function is the published literature. We present a method, neighbor divergence, for assessing whether the genes within a group share a common biological function based on their associated scientific literature. The method uses statistical natural language processing techniques to interpret biological text. It requires only a corpus of documents relevant to the genes being studied (e.g., all genes in an organism) and an index connecting the documents to appropriate genes. Given a group of genes, neighbor divergence assigns a numerical score indicating how functionally coherent the gene group is from the perspective of the published literature. We evaluate our method by testing its ability to distinguish 19 known functional gene groups from 1900 randomly assembled groups. Neighbor divergence achieves 79% sensitivity at 100% specificity, comparing favorably to other tested methods. We also apply neighbor divergence to previously published gene expression clusters to assess its ability to recognize gene groups that had been manually identified as representative of a common function.

References:

  • Raychaudhuri S, Chang JT, Imam F, Altman RB, "The computational analysis of scientific literature to define and recognize gene expression clusters", Nucleic Acids Res. 2003 Aug 1;31(15):4553-60.
  • Raychaudhuri S, Altman RB, "A literature-based method for assessing the functional coherence of a gene group", Bioinformatics. 2003 Feb 12;19(3):396-401.
  • Raychaudhuri S, Schutze H, Altman RB, "Using text analysis to identify functionally coherent gene groups", Genome Res. 2002 Oct;12(10):1582-90.
  • Raychaudhuri S, Chang JT, Sutphin PD, Altman RB, "Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature", Genome Res. 2002 Jan;12(1):203-14.



  • 10:30 a.m. Tuesday June 21, 2005

    Title: Experimental Validation of Exonic Splicing Enhancers in Arabidopsis
    By: Stephen M. Mount, Ph.D.
    Venue: Computer Science Instructional Building Room 3118
    Abstract:

    Exonic splicing enhancers are important signals lying within exons that contribute to the accurate selection of splice sites. ESEs are much better characterized in animals than in plants. We have developed an in planta ESE assay system for the systematic assessment of candidate exonic splicing enhancers. Candidates ESEs were generated computationally (in collaboration with Salzberg and Pertea at the Institute for Genomics Research, ref. 2) using a combination of methods or derived from genes of interest (ref. 1).

    Our system is based on a choice between exon inclusion and exon skipping. We have observed that exon inclusion in this assay is completely ESE dependent. Lines differing only in the 9 nt. candidate ESE insertion site can show either complete skipping or complete inclusion. Not all ESE candidates prove to have ESE activity and at least one candidate ESE (which was designed based on biochemical results) has activity without being computationally predicted. However, in every case, a point mutation predicted by a Gibbs sampler analysis (ELPH, ref. 3) to reduce ESE activity did so. This was even true for one case in which a barely detectable level of inclusion was eliminated by the single nucleotide change. Independent T-DNA insertion lines with the same construct show consistent results. ESE-dependence can be overcome by improvement of core splicing signals.

    References:

  • Lewandoska et al. 2004. Determinants of Plant U12-dependent Intron Splicing Efficiency. The Plant Cell 16:1340-1352.
  • Web-based Sequence Evaluator for ESE's
  • The ELPH Home Page



  • 10:30 a.m. Tuesday June 28, 2005

    Title: Aggressive Enumeration of Peptide Sequences for Peptide Identification by Tandem Mass Spectrometry
    By: Nathan Edwards, Ph.D.
    Venue: Computer Science Instructional Building Room 3118
    Abstract:

    Peptide identification from tandem mass spectra is a critical part of comprehensive proteome analyses. The search engines that analyze these spectra, such as Mascot, SEQUEST, X!Tandem, or OMSSA, use amino-acid sequence databases, such as SwissProt, to provide putative peptides to compare against each spectrum. This approach fails to identify peptides missing from the sequence database. We argue that amino-acid sequence databases used for peptide identification should be aggressively inclusive of potential peptide sequences, rather than conservative, and show that this need not increase search engine running times significantly.

    We begin with a whirlwind tour of methods used to construct the popular protein sequence databases to understand why peptide sequences might be left out. We show how different types of peptide sequence evidence might be aggressively integrated into an inclusive peptide sequence database. Further, we demonstrate that efficiently represented, an inclusive amino-acid sequence database of peptides can, in some cases, be smaller than sequence databases in common useage.

    On going research with Chau-Wen Tseng and Xue Wu.


    10:30 a.m. Tuesday July 5, 2005

    Title: Feature Generation for Sequences with Application to Splice-Site Prediction
    By: Rezarta Islamaj
    Venue: Computer Science Instructional Building Room 3118
    Abstract:

    In this talk I will present a new approach to feature selection for sequence data.

    We identify general feature categories and give construction algorithms for them. We show how they can be integrated in a system that tightly couples feature construction and feature selection. This integrated process, which we refer to as feature generation, allows us to systematically search a large space of potential feature sets. We demonstrate the effectiveness of our approach for an important component of the gene finding problem, splice-site prediction. We show that predictive models built using our feature generation algorithm achieve a significant improvement in accuracy over existing, state-of-the-art approaches.


    10:30 a.m. Tuesday July 19, 2005

    Title: Assigning biological function to genomic sequence: adventures in computational gene finding and alternative splicing (Part I)
    By: Jonathan Allen
    Venue: Computer Science Instructional Building Room 3118
    Abstract:

    In this talk I will discuss a new approach to gene finding using a gene structure annotation database. The computational method is implemented in the freely available open source software package JIGSAW. I will show how JIGSAW is an extension to the computational gene finding problem and review our current prediction performance in Human. In the second half of the talk I will discuss the application of gene finding principles to develop new computational models of alternative splicing using Hidden Markov Models informed by cross species sequence conservation.


    10:30 a.m. Tuesday August 9, 2005

    Title: Assigning biological function to genomic sequence: adventures in computational gene finding and alternative splicing (Part II)
    By: Jonathan Allen
    Venue: Computer Science Instructional Building Room 3118
    Abstract:

    In this talk I will discuss a new approach to gene finding using a gene structure annotation database. The computational method is implemented in the freely available open source software package JIGSAW. I will show how JIGSAW is an extension to the computational gene finding problem and review our current prediction performance in Human. In the second half of the talk I will discuss the application of gene finding principles to develop new computational models of alternative splicing using Hidden Markov Models informed by cross species sequence conservation.