CBCB Seminar Series

Spring 2008

2 p.m. Friday January 18, 2008

Title: Algorithms for Gene Finding, Network Alignment, and Ancestral Population Inference
Speaker: Serafim Batzoglou, Ph.D. (Stanford University)
Venue: Biomolecular Science Building Room 3118

Genomics is rich with computational problems where algorithms and statistical methods can have a big impact on data analysis and biological discovery. Here, I will present three such problems.

1. Gene Finding. Given a sequenced genome, the first task is to find the genes. This core bioinformatics problem is still largely open. The set of human genes, for example, has not been finalized. Here, I will present CONTRAST, a gene finder based on a CRF/SVM approach, which is the first tool to show significant improvement in human gene finding by using multiple sequence alignments as informants.

2. Network Alignment. Protein association networks summarize our knowledge of which proteins work together in modules and networks to accomplish complex biological processes. Many global protein interaction networks have been predicted for organisms ranging from bacteria to human. Here, I will present Graemlin, a system for comparing networks across organisms and finding conserved modules - subgraphs of conserved proteins and their associations.

3. Ancestral Population Inference. Projects like HapMap provide whole-genome genotypes for diverse populations. Given a genotyped individual, using such datasets we may attempt to predict the allele-specific population source of the individual's chromosomes. I will present HAPAA, a tool for accomplishing this task. Then, I will show that ancestry inference can accurately extract the source populations of admixtures that happened as far as 20 generations ago, covering much of the modern history of population movements.

2 p.m. Thursday January 31, 2008

Title: organizational meeting
By: Stephen M. Mount, Ph.D.
Venue: Biomolecular Science Building Room 3118
Abstract: To discuss the schedule in Spring 2008.

1 p.m. Wednesday February 6, 2008

Title: A New Approach to Protein Structure Prediction
Speaker: Ming Li, Ph.D. (Univeristy of Waterloo)
Venue: A.V. Williams Building Room 3258

Protein structure prediction has been a heuristic science. From homology modeling, threading, to Monte Carlo fragment assembly, decoy clustering, selection, refinement, and consensus, there is no unified model or theory governing the complete process.

We believe the protein structure prediction problem will only be solved by a simple computational model. We wish to find a single and simple mathematical model that encompasses all of above paradigms and that takes a sequence and converges to a near-native protein structure. We propose such a theory, integrating ideas from fragment assembly, hidden Markov model sampling, and Ramachandran basins. Our initial implementation, FALCON, of this theory converges to near-native structures on short benchmark protein sequences, and produces significantly better protein structures than the best programs in this field.

Joint work with: Shuaicheng Li, Dongbo Bu, and Jinbo Xu.

2 p.m. Thursday February 7, 2008

Title: Systems Analysis of Pollen Function: How can bioinformatic and computational approaches reveal insights?
Speaker: Heven Sze, Ph.D.
Venue: Biomolecular Science Building Room 3118

Male fertility depends on the proper development of the male gametophyte, successful pollen germination, tube growth and delivery of the sperm cells to the ovule. The dynamic interactions between the growing pollen tube and the pistil provide an exemplary network that is amenable to high-throughput approaches for determining gene function at the systems level. Approximately 10% of the Arabidopsis genome is preferentially or specifically expressed in pollen, however a significant number of these genes have yet to be assigned any function. Furthermore, comprehensive understanding of genes important for successful fertilization could enhance seed production and enable improved control over plant reproduction.

Several plant biologists are interested in using a multi-pronged and integrated approach to functionally analyze Arabidopsis genes that are specifically or preferentially expressed in pollen. The genes include those of unknown function (1/4) and those representing functional classes likely to mediate pollen tube growth and guidance (signaling, transporters, cytoskeleton, cell wall modification). Proposed approaches will: 1) Identify and analyze genes specifically-expressed in pollen; 2) Identify mutants and employ assays to determine the function of each gene in pollen tube growth and guidance; 3) Determine the sub-cellular localization of each protein; 4) Identify protein interaction partners in pollen; and 5) Elucidate networks of co-functional genes using computational approaches that integrate phenotypic, localization, transcriptome and protein:protein interaction data.

These studies will constitute important steps toward a comprehensive understanding of pollen tube growth and guidance and are essential to reach the goal of understanding the function of all Arabidopsis genes.


  • Honys D, Twell D (2004) Transcriptome analysis of haploid male gametophyte development in Arabidopsis. Genome Biol 5: R85
  • Bock KW, D Honys, JM. Ward, S Padmanaban, EP Nawrocki, KD Hirschi, D Twell, and H Sze (2006) Integrating Membrane Transport with Male Gametophyte Development and Function through Transcriptomics. Plant Physiol. 140, 1151-1168

  • 2 p.m. Thursday February 14, 2008

    Title: A Framework for Discovering Associations from the Annotated Biological Web
    Speaker: Woei-Jyh (Adam) Lee
    Venue: Biomolecular Science Building Room 3118

    During the last decade, biomedical researchers gained access to the entire human genome, reliable high-throughput biotechnologies, and affordable computational resources and network access. In combination, these new tools created a new model for biomedical research that no longer uses computational tools merely to monitor research, but instead exploits these tools to acquire knowledge and make discoveries. We have developed a tool to discover meaningful patterns across resources and ontologies. These patterns, corresponding to associations of pairs of CV (controlled vocabulary) terms, may yield actionable nuggets of previously unknown knowledge. Moreover, the bridge of associations across CV terms will reflect the practice of how scientists annotate data.

    We execute a protocol to follow hyperlinks, extract annotations, and generate background LSLink (Life Sciences Link) datasets of termlinks. We then mine the termlinks to find potentially meaningful associations. We use two classes of metrics to identify significant associations of pairs of CV terms. The first class is based on the LOD (logarithm of the odds) ratio and is a measure of the extent to which a specific association of CV terms deviates from one resulting from chance alone (a random association). The second class of metric is based on the hypergeometric distribution; it gives a quantification of the level of one's surprise at finding over-representation for a particular pair of CV terms in a user dataset, in comparison to the background dataset.

    We also exploit knowledge of the underlying ontologies. We aggregate termlinks and develop an extension to confidence and support. We illustrate where the benefits of exploiting ontology structure using an experimental dataset of an OMIM record related to some disease that is hyperlinked to a set of genes records from Entrez Gene that are themselves hyperlinked to a set of publications from PubMed.

    This is the ongoing research with Louiqa Raschid, Hassan Sayyadi and Padmini Srinivasan.

    11 a.m. Tuesday February 19, 2008

    Title: Bayesian analysis of complex biological systems
    Speaker: Edo Airoldi, Ph.D. (Princeton University)
    Venue: A.V. Williams Building 2460

    Modern technology has transformed the concept of data in the biological, social and computational sciences. Data collections about a number of biological systems, for instance, have grown large and heterogeneous, in terms of both the units of analysis and the measurements on such units. Measurements are typically collected over time, as non-observable mechanisms of interest unfold. This increase in the complexity of the observations, however, has hardly translated into a richer understanding of mechanisms and principles that can explain them. This problem is paramount in systems biology, where large-scale and high-throughput measurements of genes, proteins or enzymes promise fundamental insights into signaling and metabolic pathways in the cell, and the development of disease.

    In this talk, I will introduce mechanistic models and inference algorithms for the analysis of complex biological systems with the goals of testing substantive hypotheses, making predictions, and driving further experimentation, with a focus on scalability and data integration issues.

    In particular, I will demonstrate two advantages of the mechanistic approach to systems biology. In a first case study, a model of how proteins interact allows to ground the analysis in the context of accepted theories and empirical observations about the cell, and posterior (variational) inference reveals proteins' multifaceted functional role. This model leads to predictions of cellular events that we can measure; the goal is to drive experimentation in large event spaces. In a second case study, a model of cellular growth allows us to identify growth-specific programs of gene expression and suggests the notion of "effective growth rate" of a cellular culture. This model opens a window on cellular events that we cannot measure, by quantifying effective growth at a finer temporal resolution than that accessible with technology --- minutes rather than hours. These results contribute to a system-level understanding of the connections among growth rate, metabolism, environmental stress response, and the cell division cycle.


    Edo Airoldi is a postdoctoral fellow at Princeton University, affiliated with the Department of Computer Science and the Lewis-Sigler Institute for Integrative Genomics. His research interests include statistical methodology, machine learning, mechanistic models of complex systems and random graph dynamics, with application to the biological and social sciences.

    11 a.m. Thursday February 21, 2008

    Title: Algorithms for Discovering Variation with (Short) Reads
    Speaker: Michael Brudno, Ph.D. (University of Toronto)
    Venue: Biomolecular Science Building Room 3118

    In this talk I will demonstrate two algorithms for discovering variation from whole genome shotgun sequencing data. First, I will present the SHRiMP tool for short read mapping. SHRiMP combines a novel spaced seed filtering step with a very fast implementation of the Smith-Waterman algorithm to map letter or color-space reads to a reference genome. We implement a specialized algorithm for aligning color space reads in letter space that can handle sequencing errors in a rigorous fashion. Second, I will present an algorithm for finding structural variations (large scale insertions, deletions, and inversions in the genome) using clone end sequence data. Unlike previous methods, our approach does not rely on an a priori determined mapping of all reads to the reference. Instead, we build a framework for finding the most probable assignment of sequenced clones to potential structural variants based not only on the level of sequence similarity, but also based on the other clones. We use this algorithm to compare the genome of the JCVI donor individual to the reference NCBI human genome, isolating a number of indel and inversion events, as well as a small number of inter-chromosomal events. In particular we will illustrate a deletion found in the DMBT1 gene of the JCVI donor that has previously been linked to the progression of cancers.

    11 a.m. Monday February 25, 2008

    (CBCB Candidate Talk)
    Title: Evolution of gene architecture: challenging the neutral paradigm
    Speaker: Liran Carmel, Ph.D. (Visiting Fellow, NCBI, NLM, National Institutes of Health)
    Venue: A.V. Williams Building Room 2328

    Many eukaryotic genes are fragmented along the DNA, intervened by noncoding segments called introns. Until recently, the prevailing concept held that introns evolve by nonadaptive forces as slightly deleterious elements that lack any function. However, evidence has been accumulating showing that there are anecdotal exceptions to this concept. In my talk, I will present a thorough large-scale study that implies a functional role for a previously unappreciated large fraction of introns, highlighting their importance to the rapidly rising interest in functional noncoding elements. To this end, I will lay out a comprehensive model for the evolution of gene architecture, and will introduce an intron-exon data set that is significantly larger than previously studied ones. To obtain a definitive reconstruction of gene architecture evolution, we interpret the phylogenetic tree as a graphical model, and develop an expectation-maximization algorithm to estimate the parameters of the model. We use a realization of the junction tree algorithm to compute the sufficient statistics that is required for the expectation step.

    This work culminated in several observations, some of which revise common beliefs in the field. Taken together, these findings put forward the possibility that once introns had invaded early eukaryotic genomes in an arguably nonadaptive fashion, many were exploited in novel ways, gradually gaining diverse functions, up to the point that probably only a few of today's eukaryotes could survive without them. The results of this study were integrated with whole-genome multivariate analysis at the systems level, showing that genes with high expression level and low sequence evolutionary rate have a tendency to accumulate introns. These ideas gain further credence from the fact that the positions of many exon boundaries are known to be shared between distant eukaryotic taxa, e.g., in my data set 25% of the intron positions are shared between plants and animals. This observation can be explained by either remarkable conservation of ancient introns or by parallel, independent, intron gain at the same positions. Using my algorithm, a calculation of the relative contributions of the two factors reveals that shared ancestry is by far the dominant one (for example, more than 80% of the introns shared by plants and animals are due to shared ancestry). While a mechanistic explanation cannot be ruled out, such an impressive endurance of a substantial fraction of the introns is likely to reflect their functional importance. Consequently, I suggest using conserved intron positions as a novel tool for identifying functional noncoding elements.


    Dr. Carmel received his Ph.D. in Computer Science and Applied Mathematics from the Weizmann Institute of Science, Israel, in 2004. In his Ph.D., he studied the mathematics and algorithms of odor digitization and communication. From 2004, he is a Visiting Fellow in Eugene Koonin's molecular evolution group at NIH/NLM/NCBI. There, he studies many topics of evolutionary genomics. In particular, he has investigated questions regarding eukaryote's gene architecture.

    11 a.m. Tuesday March 4, 2008

    (CBCB Candidate Talk)
    Title: How do Genes Evolve? Computational Approaches for Investigating Molecular Evolutionary Heterogeneity
    Speaker: Bryan Kolaczkowski, Ph.D. (Postdoctoral Research Scientist, University of Oregon's Center for Ecology and Evolutionary Biology)
    Venue: A.V. Williams Building Room 2460

    One of the principle problems in modern biology is understanding how evolutionary changes in molecular sequence lead to functional and phenotypic differences among species. It is impossible to examine all molecular changes experimentally due to the overwhelming volume of sequence data, so computational methods are used to predict which changes are likely to be functionally important and which are not. Here I introduce a powerful new method for predicting functionally important evolutionary changes. The method is based on a heterogeneous phylogenetic model of molecular evolution and stems from the observation that evolutionary forces act differently at different positions in the molecule and regularly change over time. These changes in evolutionary pressures leave a signature at the sequence level that can be used to understand how the molecules have evolved. I show that this heterogeneous model provides an improved fit to empirical sequence data compared to existing models and improves the quality of inferred evolutionary relationships, providing a more accurate framework for interpreting comparative results. I also show how this model can predict the specific evolutionary forces acting at each position in the molecule, providing a detailed description of molecular evolution that can be used to make functional predictions. Complex evolutionary models like the one I describe pose significant computational challenges, particularly increased algorithmic complexity and the potential for model overfitting. I highlight some of these potential drawbacks and the techniques I am using to address them. Heterogeneous models like the one I have developed can potentially drive scientific discovery by sifting through copious molecular sequence data to generate strong testable hypotheses.


    In 2006, Bryan received his PhD in computer science from the University of Oregon, where he was trained in both computer science and biology through an NSF IGERT program in evolution, development and genomics. Since then, Bryan has been working as a postdoctoral scientist at the University of Oregon's Center for Ecology and Evolutionary Biology. Bryan's research focuses on developing statistical and computational methods for investigating molecular evolution and using these methods to address important biological questions.

    2 p.m. Wednesday March 5, 2008

    (Cell Biology and Molecular Genetics Special Seminar)
    Title: Integrated genomics and computational systems biology for tuberculosis
    Speaker: James Galagan, Ph.D. (Associate Director of Microbial Genome Analysis, Broad Institute of MIT and Harvard)
    Venue: Biosciences Research Building Room 1103

    The combination of computational biology and genomic technology is providing new approaches to the study of microbiology and infectious disease. For the first time we are in a position construct a comprehensive view of the molecular networks underlying microbial physiology and pathogenesis. This in turn promises to have a direct impact on public health and clinical care. In this talk I will describe computational genomic approaches to studying the metabolic and genetic networks of Mycobacterium tuberculosis , the causative agent of TB.

    Metabolic changes are a critical component of TB pathogenesis and latency. And many first line TB drugs target metabolism. To better study TB metabolism, we developed computational methods to model metabolic networks. In particular we have developed a novel approach to coupling expression array data with computational flux balance analysis to predict metabolic state from gene expression state. I will describe this method and our application of it to predict the metabolic impact of drugs and environmental conditions on mycolic acid biosynthesis. I will also describe a related method we developed to predict nutrient and environmental conditions from metabolic state predictions a potential tool for investigating the phagosomal environment within which the intracellular pathogen TB lives.

    Metabolic changes are orchestrated by gene regulatory changes, and I seek to understand the regulatory programs at work during TB pathogenesis at the level of genes, operons, and the full regulatory network. I will describe our work to computationally annotate genes and operons, and in particular the development of a method for sequence annotation using Conditional Random Fields. I will also describe our strategy to combine comparative analysis, expression mining, and Chip-Seq to map TB regulatory networks. Ultimately, my goal is to derive an integrated model of TB metabolism and gene regulation that can be used to address fundamental questions concerning TB pathogenesis, persistence, and drug resistance.


    James Galagan is Associate Director of Microbial Genome Analysis at the Broad Institute of MIT and Harvard. He oversees the analysis of genomic data generated by the Microbial Sequencing Center and the Fungal Genome Initiative. His group develops and applies computational and genomic methods to study regulation and evolution with particular focus on the biology of infectious disease. Current projects include comparative analysis of Plasmodium spp., Mycobacteria tuberculosis, and Cryptococcus spp. In addition, James has NSF support to develop and apply computational tools for comparative fungal genome analysis.

    11 a.m. Thursday March 6, 2008

    (CBCB Candidate Talk)
    Title: Integrative structural and functional analysis to study proteins and their interactions
    Speaker: Anna Panchenko, Ph.D. (Associate Investigator, NCBI, NLM, National Institutes of Health)
    Venue: A.V. Williams Building Room 2460

    While several hundreds of complete genomes have been sequenced so far, the biological roles of many gene products remain uncharacterized. The integrative approach combining sequence, structural and functional analyses of proteins will make it possible to tackle this problem on a large scale. One aspect which complicates such analyses is that proteins are under constant scrutiny imposed by natural selection and must be viewed within the context of evolutionary history. To this end we analyze the patterns of evolutionary conservation for different protein regions (cores and loops) and identify useful metrics for detecting remote evolutionary relationships, finding interesting structural similarities and predict the locations of functionally important sites.

    Recent studies have shown that genomes are extremely complex, with the numerous gene products working together to perform specific cellular functions. Indeed, it is evident now that the vast majority of proteins interact with multiple partners and form intricate interaction networks. We analyze the conservation and diversity of protein-protein interactions on the example of domain-domain interactions in the structural databank, infer the biological roles of interaction interfaces, model the evolution of different domain rearrangements and compile a set of interacting domains which can be used in homology modeling.

    11 a.m. Tuesday March 11, 2008

    (CBCB Candidate Talk)
    Title: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority
    Speaker: Sourav Chatterji, Ph.D. (Postdoctoral Scholar, Genome Center, University of California at Davis)
    Venue: A.V. Williams Building Room 2460

    Metagenomics, the application of genome sequencing techniques to unculturable microbial communities is revolutionizing microbiology and has shed light on the role of these communities in our environment as well as human health. Unlike traditional genomics data, a metagenomic data-set is made up of sequence reads from multiple species with varying relative abundance. Consequently, metagenomic data is mosaic and fragmentary in nature, necessitating the development of new methods for their analysis.

    My presentation will describe computational methods that we have developed for analyzing metagenomic data. First, I will introduce CompostBin, a DNA composition based algorithm that we have developed for classifying metagenomic sequences into taxa-specific bins. Then, I will discuss a pipeline for phylogenomic analysis of metagenomic data. Finally, I will discuss how these methods fit into the big picture of the profiling of microbial communities.


    Sourav Chatterji a Postdoctoral Scholar at the UC Davis Genome Center . He graduated from UC Berkeley with a Ph. D. in Computer Science and a Designated Emphasis(minor) in Computational and Genomic Biology. One of his main contributions there was the development of GeneMapper, a reference based genome annotation program. GeneMapper has been used extensively for annotating genomes and studying genome evolution across a variety of eukaryotic clades. More recently, he has been working with Prof. Jonathan Eisen at UC Davis on computational methods in metagenomics.

    11 a.m. Wednesday March 12, 2008

    (CBCB Candidate Talk)
    Title: Estimating the significance of sequence motifs
    Speaker: Uri Keich, Ph.D. (Assistant Professor, Department of Computer Science, Cornell University)
    Venue: A.V. Williams Building Room 2460

    The identification of transcription factor binding sites is an important step in understanding the regulation of gene expression. To address this need, many motif-finding tools have been described that can find short sequence motifs given only an input set of sequences. Our talk is dedicated to the computational analysis of the significance of the motifs reported by those motif finders. Somewhat surprisingly, development of this significance analysis has lagged considerably behind the extensive development of the finders themselves. Nevertheless, this analysis is often the only information available to biologists when deciding whether or not to invest the resources required to verify the predictions of those finders.


    Uri Keich has been an assistant professor in the Department of Computer Science at Cornell University since 2003. He has won an NSF CAREER award as well as the Wilhelm T. Magnus Memorial Prize for Significant Contributions to the Mathematical Sciences given by the Courant Institute at NYU.

    2 p.m. Thursday March 27, 2008

    Title: Two-Sided Relative Ranking: A Robust Indirect Similarity Measure for Gene Expression Data
    Speaker: Louis Licamele
    Venue: Biomolecular Science Building Room 3118

    There is a wealth of gene expression data available in the public domain. However, because of variations in experimental conditions, it is often difficult to combine information across the different sources. In this paper we present a new method, which we refer to as indirect two-sided relative ranking, for comparing gene expression probes which is robust to variations in experimental conditions. This method extends the current best approach that is based on comparing the correlations of the up and down regulated genes with a comparison based on the correlations in rankings across the entire database. We evaluate the ability of this method to retrieve compounds with similar therapeutic effects across known experimental barriers, namely vehicle and batch effects, on two different datasets. We show that our indirect method is able to improve upon the previous state of the art method on both datasets with a substantial improvement in ranked recall of 97.03% and 49.44% respectively.

    11 a.m. Tuesday April 1, 2008

    (CBCB Candidate Talk)
    Title: Machine learning approaches for understanding the genetic basis of complex traits
    Speaker: Su-In Lee (PhD candidate, Department of Computer Science, Stanford University)
    Venue: A.V. Williams Building Room 2460

    Humans differ in many "phenotypes" such as weight, hair color and more importantly disease susceptibility. These phenotypes are largely determined by each individual's specific genotype, stored in the 3.2 billion bases of his or her genome sequence. Deciphering the sequence by finding which sequence variations cause a certain phenotype would have a great impact. The recent advent of high- throughput genotyping methods has enabled retrieval of an individual's sequence information on a genome-wide scale. Classical approaches have focused on identifying which sequence variations are associated with a particular phenotype. However, the complexity of cellular mechanisms, through which sequence variations cause a particular phenotype, makes it difficult to directly infer such causal relationships. In this talk, I will present machine learning approaches that address these challenges by explicitly modeling the cellular mechanisms induced by sequence variations. Our approach takes as input genome-wide expression measurements and aims to generate a finer-grained hypothesis such as "sequence variations S induces cellular processes M, which lead to changes in the phenotype P". Furthermore, we have developed the "meta-prior algorithm" which can learn the regulatory potential of each sequence variation based on their intrinsic characteristics. This improvement helps to identify a true causal sequence variation among very many variations in the same chromosomal region. Our approaches have led to novel insights on sequence variations, and some of the hypotheses have been validated through biological experiments. Many of the machine learning techniques are generally applicable to a wide-ranging set of applications, and as an example I will present the meta-prior algorithm in the context of movie rating prediction tasks using the Netflix data set.


    Su-In Lee is a Ph.D. candidate at Stanford University, where she is a member of the Stanford Artificial Intelligence Laboratory. Her research focuses on devising computational methodologies for understanding the genetic basis of complex traits. She is also interested in developing general machine learning algorithms for broader applications. Su-In graduated Summa Cum Laude with a B.Sc. in Electrical Engineering and Computer Science from Korea Advanced Institute of Science and Technology and was a recipient of the Stanford Graduate Fellowship.

    2 p.m. Thursday April 3, 2008

    Title: Using Similarity Flooding for Extracting Similar Parts of Proteins
    Speaker: Hassan Sayyadi
    Venue: Biomolecular Science Building Room 3118

    Proteins are the main players in the game of life. Good understanding of their structures, functions, and behaviors leads to good understanding of drugs, diseases, and thus our health. So, much effort has been done to study and categorize proteins. Nowadays, tens of thousands of proteins have been found. Moreover, the problem of comparing the proteins is hard. Therefore, efficient methods are needed to deal with this problem. We use an important computational geometric concept and graph matching algorithm, namely, "Delaunay Tetrahedralization" and "Similarity Flooding", and propose a new idea to extract similar parts of proteins. Furthermore, we used protein fragmentation to reduce the time and storage complexity of the model for larger proteins.

    2 p.m. Thursday April 10, 2008

    Title: The iPlant Collaborative inaugural conference "Bringing Plant and Computing Scientists Together to Solve Plant Biology's Grand Challenges"
    Speaker: Stephen M. Mount, Ph.D.
    Venue: Biomolecular Science Building Room 3118

    The purpose of this conference is to 1) explain the nature of the project and 2) facilitate community discussion of what are the most compelling grand challenges, as well as the data, computational tools, and cyberinfrastructure necessary to solve those grand challenges. (presentation)

    2 p.m. Thursday April 17, 2008

    Title: Improving Reliability of Peptide Identification by Statisitcal Machine Learning
    Speaker: Xue Wu
    Venue: Biomolecular Science Building Room 3118

    Peptide identification by tandem mass spectrometry (MS/MS) is the dominant proteomics workflow for protein characterization in complex samples. In this talk, I present two approaches for improving peptide identification reliability using statistical machine learning.

    HMMatch is a hidden Markov model approach to spectral matching, in which many examples of a peptide's fragmentation spectrum are summarized in a generative probabilistic model that captures the consensus and variation of each peak's intensity.

    PepArML (Peptide Identification Arbiter by Machine Learning) is a machine learning based algorithm for unifying current peptide identification softwares. It provides better specificity and sensitivity by effectively utilizing multiple tandem MS search engines and additional spectra features. We demonstrate that both approaches achieved better accuracy compared with popular peptide identification softwares by extracting and using more information hidden in the protein tandem mass spectra.

    2 p.m. Thursday April 24, 2008

    Title: Statistical Methods for Detecting Differentially Abundant Taxa in Metagenomic Samples
    Speaker: James Rebert White
    Venue: Biomolecular Science Building Room 3118

    Numerous studies are currently underway to characterize the microbial communities inhabiting our world. These studies will dramatically expand our understanding of the microbial biosphere and, more importantly, will reveal the secrets of the complex symbiotic relationship between us and our commensal bacterial communities. An important prerequisite for such discoveries are computational tools able to rapidly and accurately compare large datasets generated from complex bacterial communities.

    I will describe a statistical method for detecting differentially abundant organisms between two populations using count data (e.g. 16S rRNA surveys). In high-complexity environments, our method employs the false discovery rate to improve specificity and properly handles low abundance taxa. To demonstrate the use of our tool, I shall present comparisons of publicly available human and mouse gut microbiome datasets, identifying differences between these bacterial populations at different levels of resolution. Furthermore, we have re-analyzed the data generated in a recent study on obesity and identify a previously uncharacterized difference between the gut flora of obese and lean human subjects.

    2 p.m. Thursday May 1, 2008

    Title: Protein interaction networks in viruses and bacteria: going beyond eukaryotes
    Speaker: Peter Uetz, Ph.D. (J. Craig Venter Institute)
    Venue: Biomolecular Science Building Room 3118

    Protein interaction mapping studies have previously focused on yeast and other eukaryotes. However, more recently, several groups have published more or less "comprehensive" protein interaction datasets for bacteria, namely E. coli, Campylobacter jejuni, and Treponema pallidum. In addition, there are "incomplete" datasets for several other species. While yeast data have provided enough data for the development of many bioinformatics tools to analyze this data and combine, correlate, and compare it to other data, such studies are much less developed in bacteria. I will present data from our own project on Treponema pallidum, the syphilis spirochete, and discuss what we can (and cannot) learn from it.

    We have also done various projects on systematic protein interaction mapping in human viruses and are about to start similar projects for bacteriophage. Although we are still at the beginning, such studies will provide a starting point for host-pathogen systems biology.

    2 p.m. Thursday May 8, 2008

    Title: (Re)-assembly of the cow genome
    Speaker: Guillaume Marcais
    Venue: Biomolecular Science Building Room 3118

    We present here a new assembly of the cow genome and some of the methods used to create it. This new assembly is of better (both qualitatively and quantitatively) than the previous one done by the Baylor College of Medicine.

    Title: Improving Draft Assemblies using Existing Data
    Speaker: Poorani Subramanian
    Venue: Biomolecular Science Building Room 3118

    We will introduce a simple algorithm for closing gaps and fixing misassemblies in draft genomes.

    2 p.m. Thursday May 15, 2008

    Title: Development of a phylogenomics pipeline for the analysis of genomic data from the haptophyte Emiliania huxleyi and est data from dinoflagellates
    Speaker: John J. Miller
    Venue: Biomolecular Science Building Room 3118

    Haptophytes and dinoflagellates are prominent members of marine phytoplankton and are responsible for a significant portion of global primary productivity. Both groups have unique cytological features including secondary or tertiary plastids. I have been working on an analytical pipeline to sift through the genome of the haptophyte Emiliania huxleyi and est data from various dinoflagellates and produce preliminary phylogenetic analyses. I will talk about the development of my pipeline and present preliminary results. Several problems still exist one of which is that many queries get few if any hits. Another problem is that the resulting trees frequently include aberrantly placed long branching taxa. Misplaced long branching taxa may result from a reversed sequence polarity or may consist of a limited portion of the gene sequence.