CBCB Seminar Series

Fall 2005

2 p.m. Wednesday August 31, 2005

Title: organizational meeting
By: Stephen M. Mount, Ph.D.
Venue: A.V. Williams Building Room 3258
Abstract: To discuss the schedule in Fall 2005.

2 p.m. Wednesday September 14, 2005

Title: A modular assembly package and applications
By: Mihai Pop, Ph.D.
Venue: Biomolecular Sciences Building #296 Room 3118

Recent developments in assembly algorithms have been a key element in our ability to sequence the genomes of organisms. Most notably, advances in genome assembly algorithms have led to an accelerated completion of the human genome project. Existing assembly programs, such as Celera Assembler and Arachne, are, however, difficult to use and often require user intervention to obtain the best assembly. Furthermore, they are ill-suited to specialized assembly tasks such as the assembly of highly polymorphic genomes (e.g. the sea squirt Ciona savigniy) or the assembly of uncultured organisms directly from the environment (e.g. metagenomic analysis of the bacterial populations in the Sargasso Sea). To address these issues we set out to implement a flexible framework for the development of assembly algorithms. This project, AMOS, provides scientists with well documented interfaces, a uniform representation of assembly data, as well as numerous utilities for manipulating and analyzing genome assemblies.

During my talk I will provide you with an overview of the AMOS project. I will also describe examples of the applications of AMOS to comparative genome assembly, metagenomic analyses and heterozygous SNP detection.

2 p.m. Wednesday September 21, 2005

Title: An Introduction to Probabilistic Relational Models for Biological and Clinical Applications
By: Lise Getoor, Ph.D.
Venue: Biomolecular Sciences Building #296 Room 3118

A large portion of real-world biological data is stored in commercial relational database systems. In contrast, most statistical learning methods work only with "flat" data representations. Thus, to apply these methods, we are forced to convert the data into a flat form, thereby losing much of the relational structure present in the data and potentially introducing statistical skew. These drawbacks severely limit the ability of current methods to mine relational databases.

In this talk I will review recent work on probabilistic models, including Bayesian networks (BNs) and Markov Networks (MNs) and their relational counterpoints, Probabilistic Relational Models (PRMs) and Relational Markov Networks (RMNs). I'll briefly describe the development of techniques for automatically inducing PRMs directly from structured data stored in a relational or object-oriented database. These algorithms provide the necessary tools to discover patterns in structured data, and provide new techniques for mining relational data. As we go along, I'll present experimental results in several domains, including a biological domain describing tuberculosis epidemiology, a database of scientific paper author and citation information, and Web data.

2 p.m. Wednesday September 28, 2005

Title: Improving Genome Assemblies without Sequencing
By: Michael Schatz
Venue: Biomolecular Sciences Building #296 Room 3118

Genome assembly is the problem of reconstructing the genome sequence of an organism from a collection of short sequenced reads. An assembly takes the form of contiguous stretches of DNA sequence (contigs) linked together in scaffolds by mate-pair and other information. Genome assembly is scientifically one of the most important areas of bioinformatics research as an accurate genome sequence is needed for addressing several fundamental biological questions. Unfortunately, it is also one of the most complex computationally, having been proved NP-hard under various formalisms and a typical problem size of thousands or millions of inputs.

During my talk, I will discuss some of the algorithmic challenges and trade-offs in genome assembly. I will also discuss some computational methods for improving an assembly, which can be applied generally but without requiring additional laboratory results. One method was implemented in AutoEditor, which acts as a second generation base-caller to find and correct base-calling errors in reads using the original chromatogram trace and the multiple alignment of reads. A second was implemented in AutoJoiner, which attempts to automatically close gaps between linked contigs, and generally enhance contig quality, by extending the usable portion of reads within an assembly.

2 p.m. Wednesday October 5, 2005

Title: Aggressive Enumeration of Peptide Sequences for Peptide Identification by Tandem Mass Spectrometry
By: Nathan Edwards, Ph.D.
Venue: Biomolecular Sciences Building #296 Room 3118

Peptide identification from tandem mass spectra is a critical part of comprehensive proteome analyses. The search engines that analyze these spectra, such as Mascot, SEQUEST, X!Tandem, or OMSSA, use amino-acid sequence databases, such as SwissProt, to provide putative peptides to compare against each spectrum. This approach fails to identify peptides missing from the sequence database. We argue that amino-acid sequence databases used for peptide identification should be aggressively inclusive of potential peptide sequences, rather than conservative, and show that this need not increase search engine running times significantly.

We begin with a whirlwind tour of peptide identification from tandem mass spectra and methods used to analyze them. We'll quickly cover the traditional methods used to populate protein sequence databases to understand why peptide sequences might be left out. We show how different types of peptide sequence evidence might be aggressively integrated into an inclusive peptide sequence database. Further, we demonstrate that efficiently represented, an inclusive amino-acid sequence database of peptides can, in some cases, be smaller than sequence databases in common useage.

On going research with Chau-Wen Tseng and Xue Wu.

2 p.m. Wednesday October 12, 2005

Title: A Methodology to Enhance the Semantics of Links Between PubMed Publications and Markers in the Human Genome
By: Woei-Jyh (Adam) Lee
Venue: Biomolecular Sciences Building #296 Room 3118

Links in life science sources capture important biological knowledge. However, current simple physical link implementations do not explicitly represent this knowledge so that it can be easily shared among scientists. We develop a methodology for link extraction and generation, and link labeling to produce an enhanced e-link or an enhanced set of e-links. The e-link associates each existing link with a link label that captures semantics of the link. We develop a machine assisted tool for curators to produce e-links and we develop a search interface for biologists to discover interesting e-links.

On going research with Prof. Louiqa Raschid and Dr. Alex Lash.

2 p.m. Wednesday October 19, 2005

Title: Peptide Identification by Tandem Mass Spectrometry
By: Xue Wu
Venue: Biomolecular Sciences Building #296 Room 3118

Tandem mass spectra and associated database search algorithms are essential proteomic tools for identifying peptides. Even though the algorithms are effective and extensively used for identification, there are still spectra unindentified by these searches. Since the database search approach compares the theoretical spectra constructed from database peptides and experimental spectra, assign scores for the matches, score function is the key factor that decides the sensitivity and specificity of the search algorithm.

In this talk, I will present my recent research about sensitivity and specificity of database search score functions. I will briefly describe the most popular database search algorithms: Mascot, X!Tandem and OMSSA, then discuss the difference by experimental results and theoretical analysis.

2 p.m. Wednesday October 26, 2005

Title: New Perspectives on Protein Sequences and Structures
By: William R. Pearson, Ph.D. (University of Virginia)
Venue: Biomolecular Sciences Building #296 Room 2118

High-throughput DNA sequencing, fast computers, and effective sequence analysis algorithms produced a new approach to biological problems - genome biology. Comprehensive sequence databases and effective sequence programs like BLAST and FASTA are now routinely used to annotate bacterial and eukaryotic genomes. While complete genome sequencing is a cost-effective strategy for characterizing an organism, the sequences also allow us to address fundamental questions about protein folding, and the nature of protein space.

In this talk, I will discuss why protein sequence comparison "works" -- why it is an effective strategy for finding homologous proteins with similar structures. I will then extend that observation to ask what the success of protein sequence comparison may tell us about the constraints that allow them to fold. Specifically, I will consider several widely held beliefs about protein sequences and structures: (1) a small fraction of the protein's sequence determines the structure; (2) structure comparison programs are much more sensitive for finding distant homologs than sequence comparison; (3) there are strong constraints on protein sequences producing "fold-able" motifs. These beliefs reflect the view that protein folding is very very hard, and that the "space" of proteins that can fold has been largely explored by nature. I will argue that current, comprehensive sequence and structure information supports a different model of the protein universe, in which known proteins represent the small fraction of the possible proteins.

2 p.m. Wednesday Novmber 2, 2005

Title: Structural views of disease-causing mutations in PAH gene and using genetic evolution to understand diseases' history
By: Zhen Shi
Venue: Biomolecular Sciences Building #296 Room 3118

The first part, I'll discuss the correlation of the Phenylketonuria phenotypes with structural defects of mutations on the responding protein, Phenylalanine Hydroxylase(PAH). And we pose destabilization of PAH as a major effect of most disease causing mutations. In the second part, more like a journal club, I want to discuss how the evolutionary and population genetics can help us uncover the history of some diseases like Cystic Fibrosis(CFTR gene), HIV(CCR5) et al.

2 p.m. Wednesday Novmber 9, 2005

Title: Power Laws for Repeat Strings in DNA
By: Suzanne Sindi
Venue: Biomolecular Sciences Building #296 Room 3118

While it is often remarked that genomes contain significant amounts of repetitive sequence, what is meant by repetitive sequence is left ambiguous and generally includes cases where two segments of the genome are very similar but not necessarily identical. To avoid this ambiguity we define a representation of repetitive DNA we call a repeat string.

We observe and discuss power law distributions in the size and frequency of repeat strings in the genomes of C. elegans, A. thaliana and Human chromosome 21. We have developed a iterative evolutionary model that may explain one of the observed power laws.

While the cause of the power laws remains unknown, these results suggest that there may be more statistical structure than previously thought in repetitive DNA. This structure may suggest ways to more efficiently assemble repetitive regions of DNA.

2:30 p.m. Thursday Novmber 17, 2005

Title: Trends in North American Butterfly Populations using the 4th of July Butterfly Counts
By: Leslie Ries
Venue: Biomolecular Sciences Building #296 Room 3118

The 4th of July butterfly counts is a nation-wide program to collect date on the distribution and abundance of butterflies. Started in 1975 with 34 counts across the country, there are currently almost 500 counts that occur each year. In each survey, volunteers count all the butterflies they see in a single day (usually near the 4th of July) in a set location. Although over 6000 surveys have been conducted over the past 30 years, no rigorous analyses of the data has ever occured. I recently acquired this data set, and am now analyzing it to discover species that are showing any shifts in populations (either increasing, declining or shifting their range). I will present some of the challenges of working with this large data set, and also present some preliminary results on trends in monarch populations. This analysis focuses on comparing summer, breeding abundances of the migrating monarch butterfly with population densities on their overwintering grounds in Mexico. Our results suggest that summer reproduction, rather than winter mortality drives population trends.

2 p.m. Thursday December 1, 2005

Title: Trimming Vector from Reads
By: Michael Roberts, Ph.D.
Venue: Biomolecular Sciences Building #296 Room 3118

Whole genome shotgun assembly (WGSA) is a method of determining the sequence of a genome. The genome is broken into overlapping pieces, called reads, for which the sequence can be determined. Unfortunately, there is often a contaminant, called vector, on the end of a read. Trimming vector from reads is an important part of WGSA, and has a large impact on the quality of the final assembly. Unfortunately, accurate techniques for vector trimming are not publicly available. We will present a new, accurate method of vector trimming.

2 p.m. Thursday December 8, 2005

Title: Gene Name Normalization using Text Match with Automatically Extracted Synonym Dictionaries
By: Haw-ren Fang
Venue: Biomolecular Sciences Building #296 Room 3118

Gene normalization, a relatively new and unexplored problem, has gradually received attention in recent years. The procedure typically consists of two stages: identifying gene mentions and normalization of gene names. Besides, there are usually a pre-processed synonym dictionary and a post disambiguation stage.

Compared with identifying gene mentions, gene normalization is easier because identification of textual boundaries of each mention is not required. However, gene normalization requires the actual gene detected and reported in the unique gene normal form. From this point of view, it is harder than identifying gene mentions.

We have built a robot that can automatically extract human gene synonyms from online databases to build our synonym dictionaries (300,000+ entries). In the first stage, a CRF tagger is used to automatically annotates given abstracts or documents. For the second stage, we compiled various string transformations that can be applied and chained in flexible order, followed by exact string matching or approximate string matching. Our system achieved 0.648 F-measure (0.597 precision and 0.709 recall).

Gene normalization has several potential applications, such as biomedical information extraction, database curation, and further text mining. Our first application is the relevance search and ranking in the Fable project. Providing a proper synonym dictionary, our normalization program is readily to generalize to other organisms and name normalization tasks (e.g., malignancy, variation, etc.).

This is joint work with Peter S. White (mentor) et al. in Children's Hospital of Philadelphia.

11 a.m. Tuesday December 20, 2005

Title: Comparative and functional genomic approaches to the analysis of gene function in three human parasites
By: Najib M. El-Sayed, Ph.D. (TIGR)

Venue: HJ Patterson Hall Room 2242

PS: This is a candidate talk.