CBCB Seminar Series

Spring 2006

2 p.m. Thursday January 26, 2006

Title: organizational meeting
By: Stephen M. Mount, Ph.D.
Venue: Biomolecular Science Building Room 3118
Abstract: To discuss the schedule in Spring 2006.

2 p.m. Thursday February 16, 2006

Title: High-throughput Biology: Genome Assembly and Beyond
Speaker: Mihai Pop, Ph.D.
Venue: Computer Science Instructional Center Room 1115
Abstract:

Computers have become indispensable tools in biological research. The increasing use of high-throughput laboratory experiments has yielded large amounts of data that cannot be managed, let alone analyzed, without the help of specialized software. The integration of computational methods and mathematical analyses into biological research have led to our ability to sequence the DNA of organisms, recognize genes, and begin to unravel the complex interactions that define life itself.

In this presentation I will describe several examples of this close integration between computational science and biology, primarily from my recent work in the field of genome sequencing and assembly. The talk will provide an overview of the biological questions being addressed and will highlight the computational challenges underlying each specific genome analysis task. I will then present the techniques I used to analyze the bacteria present in the human gastrointestinal tract and will conclude with an overview of several exciting ongoing research projects.

(This is a candidate talk.)

2 p.m. Thursday February 23, 2006

Title: Relational life science databases: Lessons from Cognia and NIAID
Speaker: Christopher Larsen, Ph.D. (NIAID Bioinformatics Resource Center)
Venue: Biomolecular Sciences Building #296 Room 3118
Abstract:

Life science databases store and relate millions of bits of information. The data ranges in scale from DNA sequence to protein, organelle, cell, and even tissue and epidemiology. Creating them is a necessary downstream fact of both the genomics revolution and the long history of research publication.

Dr. Larsen's work has been aimed at integrating all sources of life science data. It has focused on building relational structures to house and query that information. The talk will focus on the successes and pitfalls of the last two efforts, and also will gather guidance from other sources involved peripheral in his work, such as Genbank, Wiley Interscience, GO (the Gene Ontology), SwissProt, BioPerl, and others. Focus will be on the problems to be overcome in storing and relating data, and potential paths in the future for the field to take.

2 p.m. Thursday March 9, 2006

Title: Towards an RNA Splicing Code
Speaker: Christopher Burge, Ph.D. (MIT)
Venue: Computer Science Instructional Center Room 2117
Abstract:

I will describe my lab's progress toward understanding the rules for exon recognition by the RNA splicing machinery in mammals. Current efforts are focused on systematic identification and characterization of sequences that function as exonic and intronic splicing silencers (ESS, ISS) and enhancers (ESE, ISE), using a combination of cell-based and computational screens. The identified splicing regulatory elements are being integrated with statistical models of the core splice site motifs into computer algorithms that simulate RNA splicing specificity. Recently, we have shown that ESS sequences play general roles in splice site definition at both the 5' and 3' splice sites, and we are investigating the mechanisms of this activity. We have also obtained evidence that ESS sequences are likely to control alternative 5' and 3' splice site usage in many exons, a common type of alternative splicing in mammals.

12:30 p.m. Thursday March 16, 2006

Title: Understanding Protein Function on a Genome-scale using Networks
Speaker: Mark B. Gerstein, Ph.D. (MB&B Dept. Yale University)
Venue: Computer Science Instructional Center Room 1115
Abstract:

My talk will be concerned with topics in proteomics, in particular predicting protein function on a genomic scale. We approach this through the prediction and analysis of biological networks -- both of protein-protein interactions and transcription-factor-target relationships. I will describe how these networks can be determined through Bayesian integration of many genomic features and how they can be analyzed in terms of various simple topological statistics.

http://bioinfo.mbb.yale.edu

http://topnet.gersteinlab.org

References:

A Bayesian networks approach for predicting protein-protein interactions from genomic data. R Jansen, H Yu, D Greenbaum, Y Kluger, NJ Krogan, S Chung, A Emili, M Snyder, JF Greenblatt, M Gerstein (2003) Science 302: 449-53.

ExpressYourself: A modular platform for processing and visualizing microarray data. NM Luscombe, TE Royce, P Bertone, N Echols, CE Horak, JT Chang, M Snyder, M Gerstein (2003) Nucleic Acids Res 31: 3477-82.

TopNet: a tool for comparing biological sub-networks, correlating protein properties with topological statistics. H Yu, X Zhu, D Greenbaum, J Karro, M Gerstein (2004) Nucleic Acids Res 32: 328-37.

Genomic analysis of regulatory network dynamics reveals large topological changes. NM Luscombe, MM Babu, H Yu, M Snyder, SA Teichmann, M Gerstein (2004) Nature 431: 308-12.

Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs. H Yu, NM Luscombe, HX Lu, X Zhu, Y Xia, JD Han, N Bertin, S Chung, M Vidal, M Gerstein (2004) Genome Res 14: 1107-18.

2 p.m. Thursday March 30, 2006

Title: Genome Explorations: Bizarre Bacteria, Exotic Environments, and How They Interact
Speaker: Naomi Ward, Ph.D. (TIGR)
Venue: Biomolecular Science Building Room 3118
Abstract:

Genomics, which explores the biology of organisms through their genetic blueprints, has led us to revise our definitions of microbial entities, reconsider their capabilities, and re-evaluate the microbiological toolbox of methods and approaches. In the breadth of its influence on various subdisciplines of microbiology (e.g., physiology, ecology, host-pathogen relationships), and its interaction with other disciplines (e.g., human and veterinary medicine, agriculture, evolutionary biology, structural biology), the impact of genomics on microbiology has been enormous. Some of the most recently emerging disciplinary interactions (those occurring between genomics, ecology, and taxonomy) will be presented, illustrated by examples from recent projects. These include the predicted marine "opportunitroph" Silicibacter pomeroyi, the morphologically bizarre Hyphomonas neptunium, and Acidobacterium capsulatum, member of a ubiquitous but poorly understood bacterial phylum. Recent work on the deep-sea microbial communities associated with Alaskan corals and giant tubeworms of the Galapagos Rift will also be presented.

Some papers that may be of interest:

Moran, M. A., A. Buchan, J.M. Gonzalez, J.F. Heidelberg, J. Henriksen, W.B. Whitman, R.P. Kiene, L. Brinkac, M. Lewis, S. Johri, B. Weaver, G. Pai, J.A. Eisen, G. King, M.R. Belas, C. Fuqua, E. Rahe, W. Sheldon, W. Ye, J.M. Carlton, D.A. Rasko, I.T. Paulsen, Q. Ren, S.C. Daugherty, R.T. Deboy, R.J. Dodson, A.S. Durkin, R. Madupu, W.C. Nelson, S.A. Sullivan, M. J. Rosovitz, D.H. Haft, J. Selengut, and N. Ward. 2004. Genome Sequence of Silicibacter pomeroyi reveals adaptations to the marine environment. Nature 432:910-913.

Badger, J.H., J.A. Eisen, and N. Ward. 2005. Genomic analysis of Hyphomonas neptunium contradicts 16S rRNA-based phylogenetic analysis; implications for the taxonomy of the orders Rhodobacterales and Caulobacterales. International Journal of Systematic and Evolutionary Microbiology 55:1021-6.

Ward, N., and C.M. Fraser. 2005. How genomics has affected the concept of microbiology. Current Opinions in Microbiology. 8(5):564-71.

Ward, N. 2006. New directions and interactions in metagenomics research. FEMS Microbiology Ecology 55:331-8.

Penn, K., D. Wu, and N. Ward. 2006. Characterization of bacterial communities associated with deep-sea corals on Gulf of Alaska seamounts. Applied and Environmental Microbiology 72(2):1680-3.

2 p.m. Thursday April 6, 2006

Title: Decomposition of overlapping protein complexes: a graph theoretical method for analyzing static and dynamic protein associations
Speaker: Elena Zotenko
Venue: Biomolecular Science Building Room 3118
Abstract:

(joint work with Katia Guimaraes, Raja Jothi, and Teresa Przytycka)

The complexity in biological systems arises not only from various individual protein molecules but also from their organization into systems with numerous interacting partners. In fact, most cellular processes are carried out by groups of proteins that associate together to perform a specific task. Recent advances in high-throughput determination of protein interactions have resulted not only in complete protein interaction maps for several model organisms, such as yeast and fruit fly, but also in more specialized protein interaction maps that include proteins involved in a particular cellular process, such as the NF-kB signaling pathway and cell-cycle.

Protein interactions are routinely represented as graphs or protein interaction networks, with proteins as nodes and interactions as edges. Even though these networks may contain inaccuracies due to experimental errors and may not capture all the complexity of protein interactions in an underlying biological process, the study of their topological properties has become an important tool in searching for general principles that govern the organization of molecular networks. In 1999, Hartwell et al. introduced a notion of a functional module, a group of cellular components and their interaction that can be attributed a specific biological function. The authors also suggested the modular organization of molecular interaction networks, where each functional module involves a small number of cellular components and is autonomous, i.e., its interaction with other modules is limited to a few cellular components.

I will start my talk with an overview of computational techniques proposed for identification and analysis of functional modules within a protein interaction network. In the second part of my talk I will describe our recent work on identification and representation of functional groups within a functional module. Intuitively, if a functional module performs a function that requires a sequence of steps (as in the case of a signaling pathway) then functional groups are snapshots of protein associations at these steps. The proposed representation helps in understanding the transitions between functional groups and depending on the nature of the network, is capable of elucidating temporal relations between functional groups. I will conclude my talk by showing the results of applying our method to several protein interaction networks that underlie well studied cellular processes.

2 p.m. Thursday April 13, 2006

Title: The truly horrific tale of the generation and analysis of the Trichomonas vaginalis genome sequence, a sexually transmitted pathogen of humans
Speaker: Jane Carlton, Ph.D. (TIGR)
Venue: Biomolecular Science Building Room 3118
Abstract:

Trichomonas vaginalis, a human extracellular parasite of the urogenital tract, is the most prevalent sexually transmitted, non-viral, parasite found in North America, where it is responsible for approximately 5 million cases of trichomoniasis annually. In addition to its prevalence, infection with T. vaginalis is emerging as one of the most important cofactors in amplifying HIV transmission, and in contributing to low birth weight, stillbirth and neonatal death. A project to sequence the genome of T. vaginalis at TIGR was funded in 2002 by the NIAID, NIH. At 7.2-fold coverage the genome sequence is providing insights into the parasites extraordinary biology. More than one third of the ~160 megabase genome consists of highly similar copies of transposable elements and repeats, indicative of a recent genome expansion that may have occurred during the transition of the parasite from an enteric to a urogenital environment. Selected amplification of many gene families has occurred, including massive amplification of genes coding for cell surface molecules predicted to be involved in pathogenesis. An unusual pathway for cysteine biosynthesis has been identified. Genes coding for trichopores, lytic pore-forming proteins, have been identified. Finally, lateral gene transfer of bacterial genes, also predicted to have been transferred in another lumenal parasite, has helped to shape the unique metabolism of the parasite.

2 p.m. Thursday April 20, 2006

Title: Chromosomal abnormalities underlying mental retardation
Speaker: Jonathan Pevsner, Ph.D. (JHU & KKI)
Venue: Computer Science Instructional Center Room 2117
Abstract:

Mental retardation affects 2-3% of the U.S. population. It is defined by broad criteria including significantly subaverage intelligence, onset by age 18, and impaired function in a group of adaptive skills. Down syndrome (DS), caused by a trisomy of chromosome 21, is the most common genetic cause of mental retardation. We have measured the effects of trisomy 21 on transcription and translation, based on studies of gene and protein expression in the developing brain and heart. In a parallel approach, we have analyzed chromosomal abnormalities underlying mental retardation and other disorders. In particular we have identified chromosomal anomalies such as microdeletions and microduplications in Down syndrome and other mental retardation cases through the analysis of single nucleotide polymorphisms (SNPs). We developed SNPscan, a web-accessible tool to analyze and visualize chromosomal abnormalities from SNP data.

2 p.m. Thursday April 27, 2006

Title: Sequence Polymorphism Detection and Analysis
Speaker: Jim C. Mullikin, Ph.D. (NIH/NHGRI)
Venue: Biomolecular Science Building Room 3118
Abstract:

Most single nucleotide polymorphism (SNP) discovery across the human genome, available through dbSNP, has been accomplished by random shotgun sequencing of additional individuals and comparing those sequences to the reference genome using my software package called ssahaSNP. The ssahaSNP package is the combination of a very fast "Sequence Search and Alignment by Hashing Algorithm" (SSAHA) followed by a SNP detection based on the Neighborhood Quality Standard (NQS). Understanding the SNP discovery process is important in many downstream analyses, therefore I will describe the various phases of SNP discovery process. The International Haplotype Map (HapMap) Project drew from this increasing pool of publicly available SNPs, and now provides a dataset of nearly 4 million SNPs successfully genotyped across 270 individuals, i.e. over one billion genotypes. This combination of SNP discovery and genotyping provides an amazing resource for further analysis. I will show some examples of how to access these data and some analyses I have performed.

2 p.m. Thursday May 4, 2006

Title: lslink: Enhancing the Semantics of Links in Life Science Data Resources
Speaker: Woei-Jyh (Adam) Lee
Venue: Biomolecular Science Building Room 3118
Abstract:

Web accessible data resources contain an abundance of data on scientific objects such as genes, protein, sequences, citations, etc. Biologists typically explore these resources by navigating links between entries in data resources (object) as well as paths (informally, concatenations of links). While these links capture a rich semantics that is often well understood by the scientist, the link itself does not explicitly capture or represent meaning. Consequently, scientists spend significant time following links only to reject many data entries that are reached. The lack of explicit meaning also limits the sharing of this knowledge among groups of scientists who are not in the same specialization. Finally, the advent of automated tools such as scripts or mediators that may be used for data gathering and data integration are limited since they have no knowledge of the implicit semantics.

Links between entries in the resources are created for many different reasons. Biologists capture new discoveries of an experiment or study using links, whereas data curators add links to augment, to complete or to make consistent, the knowledge captured among multiple resources. For example, a result reported in a paper in PubMed may lead a curator to insert a link from a data entry in say OMIM to this publication in PubMed. Algorithms insert links automatically when discovering similarities among two data items, e.g., to represent sequence similarity following a BLAST search. Manually curated links added by record originators or curators are generally inserted into the database record itself, whereas algorithmically generated links are generally kept in a separate linking table. Thus, the simple unlabeled physical links that are in use today are insufficient to represent such subtle and diverse relationships.

We have addressed this problem by developing a methodology of lslinks between entries in resources. The lslink enhances existing links with a label (meaning). We further develop a data model and a query language that can exploit lslinks while traversing paths through the data resources. Contributions of this research include the following: (1) A methodology that includes information extraction, link generation and link labeling to enhance the semantics of lslinks. (2) An extended example of lslink extraction and labeling where we enhance the link from PubMed entries to markers in the human genome. (3) A proof-of-concept prototype comprising the extraction protocol, a hierarchy of link labels, and an experiment on machine assisted labeling of links.