CBCB Seminar Series

Fall 2007

2 p.m. Thursday August 30, 2007

Title: organizational meeting
By: Stephen M. Mount, Ph.D.
Venue: Biomolecular Science Building Room 3118
Abstract: To discuss the schedule in Fall 2007.

2 p.m. Wednesday September 5

(This is part of the CSCAMM seminar series)
Title: Estimating the Significance of Sequence Motifs
Speaker: Uri Keich (Cornell University)
Venue: CSIC 4122

Efficient and accurate statistical significance evaluation is an essential requirement of motif-finding tools. One such widely used significance criterion is the E-value of the motif's information content or entropy score. Current computation schemes used in popular motif-finding programs can unwittingly provide poor approximations. We present an approach to a fast and reliable estimation of this E-value that can be applied more generally.

Unfortunately, this improvement did not completely solve the motif significance estimation problem. In particular, we more recently found that relying on these E-values when searching for relatively weak motifs can lead to undesirable results. This motivated our design of a novel, parametric approach for analyzing the significance of sequence motifs.

2 p.m. Thursday September 13, 2007

Title: Molecular evolution in the Drosophila melanogaster species subgroup: Frequent departures from equilibrium DNA evolution and effects of natural selection
Speaker: Wen-Ya Ko, Ph.D.
Venue: Biomolecular Science Building Room 3118

Identifying relative contributions of natural selection and neutral evolutionary forces on governing base composition and protein evolution is fundamental for studying molecular evolution. In Drosophila, synonymous mutations evolve under the balance of selection-mutation-drift in which natural selection increases fixation probabilities of translationally superior codons, but mutation pressure and genetic drift allow selection-unpreferred codons to persist in a genome. Under this model, detecting deviations from the equilibrium evolution in different classes of mutations provides opportunities to determine the underlying evolutionary force(s). Here, we studied lineage-specific patterns and rates of nucleotide changes from 19 loci (10110 codons) on the six extant Drosophila lineages (D. melanogaster, simulans, teissieri, yakuba, erecta, and orena) and two ancestral lineages. Nonstationary molecular evolution (either toward GC-increasing or -decreasing) appears to occur frequently within these closely related siblings. Strong regional heterogeneity in base composition evolution was also identified from a cluster of genes located near the telomere of X chromosome in D. yakuba and D. orena. Our results show that frequent fluctuations in evolutionary parameters over relatively short timescales or narrow genomic regions can complicate interpretations of molecular evolution. Heterogeneity in patterns of nucleotide changes between different classes of mutations suggests that natural selection plays a predominant role in shaping codon bias evolution. Because effectiveness of natural selection on discriminating preferred and unpreferred mutations differs between polymorphism and divergence stages, inclusion of polymorphic mutations in a single-allele analysis may have pronounced effects on inferring patterns of DNA divergence. Preliminary investigation on within- and between-species nucleotide variation shows that the most recent common ancestor lies in deep on a gene tree in each of the D. yakuba species complex (i.e., D. teissieri, yakuba, and santomea). Polymorphic mutations appear to have potentially profound effects on patterns of molecular evolution in a single-allele analysis in these species.


  • AKASHI, H., 1996 Molecular evolution between Drosophila melanogaster and D. simulans: reduced codon bias, faster rates of amino acid substitution, and larger proteins in D. melanogaster. Genetics 144: 1297-1307.
  • AKASHI, H., 1999 Within- and between-species DNA sequence variation and the 'footprint' of natural selection. Gene 238: 39-51.
  • AKASHI, H., W. Y. KO, S. PIAO, A. JOHN, P. GOEL et al., 2006 Molecular evolution in the Drosophila melanogaster species subgroup: frequent parameter fluctuations on the timescale of molecular divergence. Genetics 172: 1711-1726.
  • KO, W. Y., S. PIAO and H. AKASHI, 2006 Strong regional heterogeneity in base composition evolution on the Drosophila X chromosome. Genetics 174: 349-362.

  • 2 p.m. Thursday September 20, 2007

    Title: High-throughput sequence alignment using Graphics Processing Units
    Speaker: Mike Schatz, Cole Trapnell
    Venue: Biomolecular Science Building Room 3118

    The recent availability of new, less expensive high-throughput DNA sequencing technologies has yielded a dramatic increase in the volume of sequence data that must be analyzed. Sequence alignment programs such as MUMmer have proven essential for analysis of these data, but researchers will need ever faster, high-throughput alignment tools running on inexpensive hardware to keep up with new sequence technologies. We present MUMmerGPU, a high-throughput parallel sequence alignment program that runs on commodity Graphics Processing Units (GPUs) in common workstations. MUMmerGPU uses the new Compute Unified Device Architecture (CUDA) from nVidia to align multiple query sequences against a single reference sequence stored as a suffix tree. By processing the queries in parallel on the highly parallel graphics card, MUMmerGPU achieves more than a 10-fold speedup over a serial CPU version of the sequence alignment kernel, and outperforms MUMmer by more than 3-fold in total application time when aligning reads from recent sequencing projects using Solexa/Illumina, 454, and Sanger sequencing technologies.

    2 p.m. Thursday September 27, 2007

    Title: Regulatory sequence encryption. From motif detection to function-specific signature resolution
    Speaker: Ivan Ovcharenko, Ph.D. (NIH/NLM/NCBI)
    Venue: Biomolecular Science Building Room 3118

    There are two fundamental challenges facing the post-genome sequencing era: understanding the association between genome variation and disease and ascertaining the role of noncoding DNA in evolution of complex genomes. Both of them are linked to mutations in gene regulatory elements that underlie evolutionary changes and population-specific differences in gene expression. In order to identify and characterize gene regulatory mutations, it is necessary to decipher the encryption of the gene regulatory code in the human and other complex genomes. We are combining comparative genomics, microarray expression data, libraries of transcription factor binding specificities, and pattern searches to develop a computational approach capable of "translating" the primal noncoding DNA sequence into signatures of tissue-specific regulatory elements. This allows us to start unveiling the gene regulatory landscape of the human genome.

    2 p.m. Thursday October 4, 2007

    Title: Tandem Mass Peptide Identification with Statisitcal Machine Learning
    Speaker: Xue Wu
    Venue: Biomolecular Science Building Room 3118

    Peptide identification by tandem mass spectrometry (MS/MS) is the dominant proteomics workflow for protein characterization in complex samples. In this talk, I present two approaches for accurate peptide identification using statistical machine learning.

    PepAr (Peptide Arbitrator) is a machine learning based algorithm for unifying current peptide identification softwares. It provides better specificity and sensitivity by effectively utilizing multiple tandem MS search engines, additional spectra features, and different statistical scoring models.

    HMMatch is a hidden Markov model approach to spectral matching, in which many examples of a peptide's fragmentation spectrum are summarized in a generative probabilistic model that captures the consensus and variation of each peak's intensity. We demonstrate that both approaches achieved better accuracy compared with popular peptide identification softwares by extracting and using more information hidden in the protein tandem mass spectra.

    2 p.m. Thursday October 11, 2007

    Title: Elucidation of protein-protein interactions in the human pathogen Trypanosoma brucei
    Speaker: Gustavo Cerqueira, Ph.D.
    Venue: Biomolecular Science Building Room 3118

    One of the challenges of the postgenomic era is to elucidate the complex networks of interacting proteins and small molecules. In order to investigate cellular machinery and host-pathogen interactions on a global scale, we are using whole-genome approaches such as the two-hybrid assay and protein co-immunoprecipitation to screen for protein-protein interactions and characterize networks systematically. We have: 1) initiated the generation of the Tryp ORFeome by recombinational cloning of the nearly the entire set of 9,000 protein-encoding genes in Trypanosoma brucei, 2) performed a multiple in silico comparison of known proteinprotein interaction networks of Caenorhabditis elegans, Drosophila melanogaster, and Saccharomyces cerevisiae with the predicted ORFeome of three related trypanosomatid parasites, and 3) tested 4 interaction predictions for T. brucei by yeast two-hybrid analysis, confirming some of them. Our comparison integrates protein interaction and sequence information to reveal network regions that were conserved across all six species. Using this conservation, we predicted first set of ~360 interologs in each of the three trypanosomatids.

    We will be automating our cloning and Y2H screening protocols to initiate comprehensive screens with the aim to determine the T. brucei interactome as well protein-protein interactions with subsets of expressed proteins and protein fragments from the human ORFeome.

    2 p.m. Thursday November 1, 2007

    Title: Precomputing Edit-Distance Specificity of Short Oligonucleotides
    Speaker: Nathan Edwards, Ph.D.
    Venue: Biomolecular Science Building Room 3118

    The oligonucleotide designs for PCR, microarray, and DNA based pathogen detection assays all rely on the hybridization specificity of short oligos to target DNA loci. Checking oligo designs for potential non-specific hybridization remains the dominant computational bottleneck in large-scale oligonucleotide design pipelines. The verification of hybridization specificity is often left to the last stage of the design pipeline, when the number candidate oligos is smallest, for this reason. For some target loci, this can lead to many design iterations before a suitable specific oligo is found.

    We will describe some new techniques for determining, in advance, the edit-distance specificity of all short oligonucleotides from large DNA sequence databases, such as the human genome or the set of all bacterial sequences. This work uses new insights on the properties of non-specific oligo sequences and lossless spaced seed design to turn an impractical quadratic time computation into a linear time computation, for typical large DNA sequence databases. Using this infrastructure, we have determined that there are no edit-distance 3 unique 20-mers in the human genome, and that less than 0.03% are edit-distance 2 unique.

    2 p.m. Thursday November 8, 2007

    Title: Learning Networks from Biology, Learning Biology from Networks
    Speaker: Chris Wiggins, Ph.D. (Columbia Univeristy)
    Venue: Biomolecular Science Building Room 3118

    Both the 'reverse engineering' of biological networks (for example, by integrating sequence data and expression data) and the analysis of their underlying design (by revealing the evolutionary mechanisms responsible for the resulting topologies) can be re-cast as problems in machine learning: learning an accurate prediction function from high-dimensional data. In the case of inferring biological networks, predicting up- or down- regulation of genes allows us to learn ab intio the transcription factor binding sites (or `motifs') and to generate a predictive model of transcriptional regulation. In the case of inferring evolutionary designs, quantitative, unambiguous model validation can be performed, clarifying which of several possible theoretical models of how biological networks evolve might best (or worst) describe real-world networks. In either case, by taking a machine learning approach, we statistically validate the models both on held-out data and via randomizations of the original dataset to assess statistical significance. By allowing the data to reveal which features are the most important (based on predictive power rather than overabundance relative to an assumed null model) we learn models which are both statically validated and biologically interpretable.


  • Manuel Middendorf, Anshul Kundaje, Chris Wiggins, Yoav Freund, and Christina Leslie. Predicting genetic regulatory response using classification. ISMB 2004; q-bio/0411028
  • Manuel Middendorf, Anshul Kundaje, Mihir Shah, Yoav Freund, Chris H. Wiggins, and Christina Leslie. Motif discovery through predictive modeling of gene regulation. RECOMB 2005.
  • M. Middendorf, E. Ziv, and C. H. Wiggins. Inferring network mechanisms: the drosophila melanogaster protein interaction network. PNAS 2005; q-bio/0408010.
  • Manuel Middendorf, et al. Discriminative topological features reveal biological network mechanisms. BMC Bioinformatics 2004; q-bio/0402017.

  • Biography:

    Chris Wiggins is an applied mathematician with a PhD in theoretical physics (Princeton, 1998) working in applications of information theory, inference, and machine learning in biology. Since 2001 he has been affiliated with the Department of Applied Physics and Applied Mathematics and the Center for Computational Biology and Bioinformatics (C2B2) at Columbia University. Previously, he was a Courant Instructor (1998-2001) at the Courant Institute, NYU. He has held visiting appointments at Institut Curie (Paris), the Hahn-Meitner Institut (Berlin), and the Kavli Institute for Theoretical Physics (UCSB).

    2 p.m. Thursday November 15, 2007

    Title: Phylogenetic Estimation for Complex Evolutionary Processes
    Speaker: Li-San Wang, Ph.D. (Penn Center for Bioinformatics, University of Pennsylvania)
    Venue: Biomolecular Science Building Room 3118

    Phylogeny reconstruction using stochastic sequence evolution models have been highly successful in reconstructing the evolutionary history of genes and species. These standard evolutionary models have two essential features, both of which are known to fail for a wide range of real biological data: (1) the domain of mutation is a concatenation of multiple independently distributed sites, each following a simple, identical stochastic process, and (2) the evolutionary history is a branching process (tree). In contrast, complex evolutionary processes lack either of the two features of standard models: in both cases, new stochastical models need to be developed, and the inference of evolutionary histories under these models is much harder - in some cases simply computationally more intense, but in other cases posing significant and new algorithmic challenges. I will cover two such processes: the process of gene order evolution, and the process of horizontal gene transfer.

    In the second half of my talk I will present two applications of using phylogenetics to model biomedical data as complex branching processes: gene expression progression in cancer, and population stratification in genome-wide association studies.


    Li-San Wang received his B.S. (1994) and M.S. (1996) in Electrical Engineering from the National Taiwan University. He received his M.S. (2000) and Ph.D. (2003) from the University of Texas at Austin, both in Computer Sciences, and was a postdoctoral fellow at the University of Pennsylvania between 2003 and 2006. Currently he is an Assistant Professor of Pathology and Laboratory Medicine, Penn Center for Bioinformatics, and a fellow of the Institute on Aging, University of Pennsylvania. Dr. Wang's research interests include phylogenetics, comparative genomics, genome-wide association studies, and microarray analysis. He served on the program and organizing committees of several international workshops and conferences including EITC, WABI, BIBE, and BIBM.

    2 p.m. Thursday November 29, 2007

    Title: Cost-effective assembly of genomes using optical maps
    Speaker: Niranjan Nagarajan, Ph.D.
    Venue: Biomolecular Science Building Room 3118

    New, high-throughput sequencing technologies have made it feasible to cheaply produce vast amounts of sequence information regarding a genome of interest. The information obtained, however, has features such as short read-lengths and absence of mate-pairs that complicate computational efforts to reconstruct the complete sequence of organisms. We propose methods (to be freely available as an open-source package called SOMA) to overcome the limitations of sequence data by reliably combining information from optical maps. Extensive experiments with simulated datasets demonstrate the robustness of these methods to sequencing and assembly errors. We also present the results obtained by applying our algorithms to data generated from two bacterial genomes Yersinia aldovae and Yersinia kristensenii. The resulting assemblies provide a single scaffold covering a large fraction of the respective genomes, suggesting that the careful use of optical maps can provide a cost-effective framework for the assembly of genomes.

    This is joint work with Dr. Mihai Pop.

    2 p.m. Thursday December 6, 2007

    Title: Multiple genome alignment and synteny chaining
    Speaker: Samuel Angiuli
    Venue: Biomolecular Science Building Room 3118

    Multiple genome alignment programs aim to identify and align homologous regions across multiple, large genomic sequences, such as whole genomes. In addition, these programs should be robust in handling rearrangements, duplications, and indels that arise at varying sizes during genome evolution. A number of multiple genome aligners are readily available but increasing numbers of sequenced genomes continue to provide challenges for multiple genome alignment. Particular challenges include scalability and usability, especially in visualization and data structures for navigating complicated relationships in multiple genome alignment output, such as multi-genome synteny.

    This talk will present preliminary and ongoing work in evaluating and extending multiple genome alignment programs to analyze genome architecture. We compare two current multiple genome aligners, TBA and Mauve, and introduce a derivative of the modular TBA program that uses NUCmer for pairwise alignments. We also show preliminary results of a graph based method for identification and retrieval of syntenic blocks and breakpoints in alignments of bacterial genomes. In addition, we've developed a number of visualization tools that aid in multiple genome analysis and are part of a integrative web based software suite called Sybil.