CBCB Seminar Series
Fall 2007
2 p.m. Thursday August 30, 2007
Title: organizational meeting
By: Stephen M. Mount, Ph.D.
Venue: Biomolecular Science
Building Room 3118
Abstract: To discuss the schedule
in Fall 2007.
2 p.m. Wednesday September 5
(This is part of the CSCAMM seminar series)
Title: Estimating the Significance
of Sequence Motifs
Speaker: Uri Keich (Cornell
University)
Venue: CSIC 4122
Abstract:
Efficient and accurate statistical significance evaluation is an essential
requirement of motif-finding tools. One such widely used significance
criterion is the E-value of the motif's information content or entropy
score. Current computation schemes used in popular motif-finding programs
can unwittingly provide poor approximations. We present an approach to a
fast and reliable estimation of this E-value that can be applied more
generally.
Unfortunately, this improvement did not completely solve the motif
significance estimation problem. In particular, we more recently found
that relying on these E-values when searching for relatively weak motifs
can lead to undesirable results. This motivated our design of a novel,
parametric approach for analyzing the significance of sequence motifs.
2 p.m. Thursday September 13, 2007
Title: Molecular evolution in the
Drosophila melanogaster species subgroup: Frequent departures from
equilibrium DNA evolution and effects of natural selection
Speaker: Wen-Ya Ko, Ph.D.
Venue: Biomolecular Science
Building Room 3118
Abstract:
Identifying relative contributions of natural selection and neutral
evolutionary forces on governing base composition and protein evolution is
fundamental for studying molecular evolution. In Drosophila, synonymous
mutations evolve under the balance of selection-mutation-drift in which
natural selection increases fixation probabilities of translationally
superior codons, but mutation pressure and genetic drift allow
selection-unpreferred codons to persist in a genome. Under this model,
detecting deviations from the equilibrium evolution in different classes
of mutations provides opportunities to determine the underlying
evolutionary force(s). Here, we studied lineage-specific patterns and
rates of nucleotide changes from 19 loci (10110 codons) on the six extant
Drosophila lineages (D. melanogaster, simulans,
teissieri, yakuba, erecta, and orena) and two
ancestral lineages. Nonstationary molecular evolution (either toward
GC-increasing or -decreasing) appears to occur frequently within these
closely related siblings. Strong regional heterogeneity in base
composition evolution was also identified from a cluster of genes located
near the telomere of X chromosome in D. yakuba and D. orena.
Our results show that frequent fluctuations in evolutionary parameters
over relatively short timescales or narrow genomic regions can complicate
interpretations of molecular evolution. Heterogeneity in patterns of
nucleotide changes between different classes of mutations suggests that
natural selection plays a predominant role in shaping codon bias
evolution. Because effectiveness of natural selection on discriminating
preferred and unpreferred mutations differs between polymorphism and
divergence stages, inclusion of polymorphic mutations in a single-allele
analysis may have pronounced effects on inferring patterns of DNA
divergence. Preliminary investigation on within- and between-species
nucleotide variation shows that the most recent common ancestor lies in
deep on a gene tree in each of the D. yakuba species complex (i.e.,
D. teissieri, yakuba, and santomea). Polymorphic
mutations appear to have potentially profound effects on patterns of
molecular evolution in a single-allele analysis in these species.
References:
AKASHI, H., 1996 Molecular evolution between Drosophila
melanogaster and D. simulans: reduced codon bias, faster rates
of amino acid substitution, and larger proteins in D. melanogaster.
Genetics 144: 1297-1307.
AKASHI, H., 1999 Within- and between-species DNA sequence variation
and the 'footprint' of natural selection. Gene 238: 39-51.
AKASHI, H., W. Y. KO, S. PIAO, A. JOHN, P. GOEL et al., 2006 Molecular
evolution in the Drosophila melanogaster species subgroup: frequent
parameter fluctuations on the timescale of molecular divergence. Genetics
172: 1711-1726.
KO, W. Y., S. PIAO and H. AKASHI, 2006 Strong regional heterogeneity
in base composition evolution on the Drosophila X chromosome. Genetics
174: 349-362.
2 p.m. Thursday September 20, 2007
Title: High-throughput sequence
alignment using Graphics Processing Units
Speaker: Mike Schatz, Cole Trapnell
Venue: Biomolecular Science
Building Room 3118
Abstract:
The recent availability of new, less expensive high-throughput DNA
sequencing technologies has yielded a dramatic increase in the volume
of sequence data that must be analyzed. Sequence alignment programs
such as MUMmer have proven essential for analysis of these data, but
researchers will need ever faster, high-throughput alignment tools
running on inexpensive hardware to keep up with new sequence
technologies. We present MUMmerGPU, a high-throughput parallel
sequence alignment program that runs on commodity Graphics Processing
Units (GPUs) in common workstations. MUMmerGPU uses the new Compute
Unified Device Architecture (CUDA) from nVidia to align multiple query
sequences against a single reference sequence stored as a suffix tree.
By processing the queries in parallel on the highly parallel graphics
card, MUMmerGPU achieves more than a 10-fold speedup over a serial CPU
version of the sequence alignment kernel, and outperforms MUMmer by
more than 3-fold in total application time when aligning reads from
recent sequencing projects using Solexa/Illumina, 454, and Sanger
sequencing technologies.
2 p.m. Thursday September 27, 2007
Title: Regulatory sequence
encryption. From motif detection to function-specific signature
resolution
Speaker: Ivan
Ovcharenko, Ph.D. (NIH/NLM/NCBI)
Venue: Biomolecular Science
Building Room 3118
Abstract:
There are two fundamental challenges facing the post-genome sequencing
era: understanding the association between genome variation and disease
and ascertaining the role of noncoding DNA in evolution of complex
genomes. Both of them are linked to mutations in gene regulatory elements
that underlie evolutionary changes and population-specific differences in
gene expression. In order to identify and characterize gene regulatory
mutations, it is necessary to decipher the encryption of the gene
regulatory code in the human and other complex genomes. We are combining
comparative genomics, microarray expression data, libraries of
transcription factor binding specificities, and pattern searches to
develop a computational approach capable of "translating" the primal
noncoding DNA sequence into signatures of tissue-specific regulatory
elements. This allows us to start unveiling the gene regulatory landscape
of the human genome.
2 p.m. Thursday October 4, 2007
Title: Tandem Mass Peptide
Identification with Statisitcal Machine Learning
Speaker: Xue Wu
Venue: Biomolecular Science
Building Room 3118
Abstract:
Peptide identification by tandem mass spectrometry (MS/MS) is the dominant
proteomics workflow for protein characterization in complex samples. In
this talk, I present two approaches for accurate peptide identification
using statistical machine learning.
PepAr (Peptide Arbitrator) is a machine learning based algorithm for
unifying current peptide identification softwares. It provides better
specificity and sensitivity by effectively utilizing multiple tandem MS
search engines, additional spectra features, and different statistical
scoring models.
HMMatch is a hidden Markov model approach to spectral matching, in which
many examples of a peptide's fragmentation spectrum are summarized in a
generative probabilistic model that captures the consensus and variation
of each peak's intensity. We demonstrate that both approaches achieved
better accuracy compared with popular peptide identification softwares by
extracting and using more information hidden in the protein tandem mass
spectra.
2 p.m. Thursday October 11, 2007
Title: Elucidation of
protein-protein interactions in the human pathogen Trypanosoma
brucei
Speaker: Gustavo Cerqueira, Ph.D.
Venue: Biomolecular Science
Building Room 3118
Abstract:
One of the challenges of the postgenomic era is to elucidate the complex
networks of interacting proteins and small molecules. In order to
investigate cellular machinery and host-pathogen interactions on a global
scale, we are using whole-genome approaches such as the two-hybrid assay
and protein co-immunoprecipitation to screen for protein-protein
interactions and characterize networks systematically. We have: 1)
initiated the generation of the Tryp ORFeome by recombinational cloning of
the nearly the entire set of 9,000 protein-encoding genes in
Trypanosoma brucei, 2) performed a multiple in silico
comparison of known proteinprotein interaction networks of
Caenorhabditis elegans, Drosophila melanogaster, and
Saccharomyces cerevisiae with the predicted ORFeome of three
related trypanosomatid parasites, and 3) tested 4 interaction predictions
for T. brucei by yeast two-hybrid analysis, confirming some of
them. Our comparison integrates protein interaction and sequence
information to reveal network regions that were conserved across all six
species. Using this conservation, we predicted first set of ~360
interologs in each of the three trypanosomatids.
We will be automating our cloning and Y2H screening protocols to
initiate comprehensive screens with the aim to determine the T. brucei
interactome as well protein-protein interactions with subsets of expressed
proteins and protein fragments from the human ORFeome.
2 p.m. Thursday November 1, 2007
Title: Precomputing Edit-Distance
Specificity of Short Oligonucleotides
Speaker: Nathan Edwards, Ph.D.
Venue: Biomolecular Science
Building Room 3118
Abstract:
The oligonucleotide designs for PCR, microarray, and DNA based pathogen
detection assays all rely on the hybridization specificity of short oligos
to target DNA loci. Checking oligo designs for potential non-specific
hybridization remains the dominant computational bottleneck in large-scale
oligonucleotide design pipelines. The verification of hybridization
specificity is often left to the last stage of the design pipeline, when
the number candidate oligos is smallest, for this reason. For some target
loci, this can lead to many design iterations before a suitable specific
oligo is found.
We will describe some new techniques for determining, in advance, the
edit-distance specificity of all short oligonucleotides from large DNA
sequence databases, such as the human genome or the set of all bacterial
sequences. This work uses new insights on the properties of non-specific
oligo sequences and lossless spaced seed design to turn an impractical
quadratic time computation into a linear time computation, for typical
large DNA sequence databases. Using this infrastructure, we have
determined that there are no edit-distance 3 unique 20-mers in the human
genome, and that less than 0.03% are edit-distance 2 unique.
2 p.m. Thursday November 8, 2007
Title: Learning Networks from
Biology, Learning Biology from Networks
Speaker: Chris Wiggins, Ph.D. (Columbia
Univeristy)
Venue: Biomolecular Science
Building Room 3118
Abstract:
Both the 'reverse engineering' of biological networks (for example, by
integrating sequence data and expression data) and the analysis of their
underlying design (by revealing the evolutionary mechanisms responsible
for the resulting topologies) can be re-cast as problems in machine
learning: learning an accurate prediction function from high-dimensional
data. In the case of inferring biological networks, predicting up- or
down- regulation of genes allows us to learn ab intio the transcription
factor binding sites (or `motifs') and to generate a predictive model of
transcriptional regulation. In the case of inferring evolutionary
designs, quantitative, unambiguous model validation can be performed,
clarifying which of several possible theoretical models of how biological
networks evolve might best (or worst) describe real-world networks. In
either case, by taking a machine learning approach, we statistically
validate the models both on held-out data and via randomizations of the
original dataset to assess statistical significance. By allowing the data
to reveal which features are the most important (based on predictive power
rather than overabundance relative to an assumed null model) we learn
models which are both statically validated and biologically interpretable.
References:
Manuel Middendorf, Anshul Kundaje, Chris Wiggins, Yoav Freund, and
Christina Leslie. Predicting genetic regulatory response using
classification. ISMB 2004; q-bio/0411028
Manuel Middendorf, Anshul Kundaje, Mihir Shah, Yoav Freund, Chris H.
Wiggins, and Christina Leslie. Motif discovery through predictive modeling
of gene regulation. RECOMB 2005.
M. Middendorf, E. Ziv, and C. H. Wiggins. Inferring network
mechanisms: the drosophila melanogaster protein interaction network. PNAS
2005; q-bio/0408010.
Manuel Middendorf, et al. Discriminative topological features reveal
biological network mechanisms. BMC Bioinformatics 2004;
q-bio/0402017.
Biography:
Chris Wiggins is an applied mathematician with a PhD in theoretical
physics (Princeton, 1998) working in applications of information theory,
inference, and machine learning in biology. Since 2001 he has been
affiliated with the Department of Applied Physics and Applied Mathematics
and the Center for Computational Biology and Bioinformatics (C2B2) at
Columbia University. Previously, he was a Courant Instructor (1998-2001)
at the Courant Institute, NYU. He has held visiting appointments at
Institut Curie (Paris), the Hahn-Meitner Institut (Berlin), and the Kavli
Institute for Theoretical Physics (UCSB).
2 p.m. Thursday November 15, 2007
Title: Phylogenetic Estimation for
Complex Evolutionary Processes
Speaker: Li-San Wang, Ph.D.
(Penn Center for Bioinformatics, University of Pennsylvania)
Venue: Biomolecular Science
Building Room 3118
Abstract:
Phylogeny reconstruction using stochastic sequence evolution models have
been highly successful in reconstructing the evolutionary history of genes
and species. These standard evolutionary models have two essential
features, both of which are known to fail for a wide range of real
biological data: (1) the domain of mutation is a concatenation of multiple
independently distributed sites, each following a simple, identical
stochastic process, and (2) the evolutionary history is a branching
process (tree). In contrast, complex evolutionary processes lack either
of the two features of standard models: in both cases, new stochastical
models need to be developed, and the inference of evolutionary histories
under these models is much harder - in some cases simply computationally
more intense, but in other cases posing significant and new algorithmic
challenges. I will cover two such processes: the process of gene order
evolution, and the process of horizontal gene transfer.
In the second half of my talk I will present two applications of using
phylogenetics to model biomedical data as complex branching processes:
gene expression progression in cancer, and population stratification in
genome-wide association studies.
Biography:
Li-San Wang received his B.S. (1994) and M.S. (1996) in Electrical
Engineering from the National Taiwan University. He received his M.S.
(2000) and Ph.D. (2003) from the University of Texas at Austin, both in
Computer Sciences, and was a postdoctoral fellow at the University of
Pennsylvania between 2003 and 2006. Currently he is an Assistant Professor
of Pathology and Laboratory Medicine, Penn Center for Bioinformatics, and
a fellow of the Institute on Aging, University of Pennsylvania. Dr. Wang's
research interests include phylogenetics, comparative genomics,
genome-wide association studies, and microarray analysis. He served on the
program and organizing committees of several international workshops and
conferences including EITC, WABI, BIBE, and BIBM.
2 p.m. Thursday November 29, 2007
Title: Cost-effective assembly of
genomes using optical maps
Speaker: Niranjan Nagarajan, Ph.D.
Venue: Biomolecular Science
Building Room 3118
Abstract:
New, high-throughput sequencing technologies have made it feasible to
cheaply produce vast amounts of sequence information regarding a genome of
interest. The information obtained, however, has features such as short
read-lengths and absence of mate-pairs that complicate computational
efforts to reconstruct the complete sequence of organisms. We propose
methods (to be freely available as an open-source package called SOMA) to
overcome the limitations of sequence data by reliably combining
information from optical maps. Extensive experiments with simulated
datasets demonstrate the robustness of these methods to sequencing and
assembly errors. We also present the results obtained by applying our
algorithms to data generated from two bacterial genomes
Yersinia aldovae and Yersinia kristensenii. The resulting assemblies
provide a single scaffold covering a large fraction of the respective
genomes, suggesting that the careful use of optical maps can provide a
cost-effective framework for the assembly of genomes.
This is joint work with Dr. Mihai Pop.
2 p.m. Thursday December 6, 2007
Title: Multiple genome alignment
and synteny chaining
Speaker: Samuel Angiuli
Venue: Biomolecular Science
Building Room 3118
Abstract:
Multiple genome alignment programs aim to identify and align homologous
regions across multiple, large genomic sequences, such as whole genomes.
In addition, these programs should be robust in handling rearrangements,
duplications, and indels that arise at varying sizes during genome
evolution. A number of multiple genome aligners are readily available but
increasing numbers of sequenced genomes continue to provide challenges for
multiple genome alignment. Particular challenges include scalability and
usability, especially in visualization and data structures for navigating
complicated relationships in multiple genome alignment output, such as
multi-genome synteny. This talk will present preliminary and
ongoing work in evaluating and extending multiple genome alignment
programs to analyze genome architecture. We compare two current multiple
genome aligners, TBA and Mauve, and introduce a derivative of the modular
TBA program that uses NUCmer for pairwise alignments. We also show
preliminary results of a graph based method for identification and
retrieval of syntenic blocks and breakpoints in alignments of bacterial
genomes. In addition, we've developed a number of visualization tools
that aid in multiple genome analysis and are part of a integrative web
based software suite called Sybil.
|
|