CBCB Seminar Series
Spring 2008
2 p.m. Friday January 18, 2008
Title: Algorithms for Gene
Finding, Network Alignment, and Ancestral Population Inference
Speaker: Serafim Batzoglou, Ph.D.
(Stanford University)
Venue: Biomolecular Science
Building Room 3118
Abstract:
Genomics is rich with computational problems where algorithms and
statistical methods can have a big impact on data analysis and biological
discovery. Here, I will present three such problems.
1. Gene Finding. Given a sequenced genome, the first task is to find the
genes. This core bioinformatics problem is still largely open. The set of
human genes, for example, has not been finalized. Here, I will present
CONTRAST, a gene finder based on a CRF/SVM approach, which is the first
tool to show significant improvement in human gene finding by using
multiple sequence alignments as informants.
2. Network Alignment. Protein association networks summarize our knowledge
of which proteins work together in modules and networks to accomplish
complex biological processes. Many global protein interaction networks
have been predicted for organisms ranging from bacteria to human. Here, I
will present Graemlin, a system for comparing networks across organisms
and finding conserved modules - subgraphs of conserved proteins and their
associations.
3. Ancestral Population Inference. Projects like HapMap provide
whole-genome genotypes for diverse populations. Given a genotyped
individual, using such datasets we may attempt to predict the
allele-specific population source of the individual's chromosomes. I will
present HAPAA, a tool for accomplishing this task. Then, I will show that
ancestry inference can accurately extract the source populations of
admixtures that happened as far as 20 generations ago, covering much of
the modern history of population movements.
2 p.m. Thursday January 31, 2008
Title: organizational meeting
By: Stephen M. Mount, Ph.D.
Venue: Biomolecular Science
Building Room 3118
Abstract: To discuss the schedule
in Spring 2008.
1 p.m. Wednesday February 6, 2008
Title: A New Approach to Protein
Structure Prediction
Speaker: Ming Li, Ph.D. (Univeristy of
Waterloo)
Venue: A.V. Williams Building Room
3258
Abstract:
Protein structure prediction has been a heuristic science. From homology
modeling, threading, to Monte Carlo fragment assembly, decoy clustering,
selection, refinement, and consensus, there is no unified model or theory
governing the complete process.
We believe the protein structure prediction problem will only be solved by
a simple computational model. We wish to find a single and simple
mathematical model that encompasses all of above paradigms and that takes
a sequence and converges to a near-native protein structure. We propose
such a theory, integrating ideas from fragment assembly, hidden Markov
model sampling, and Ramachandran basins. Our initial implementation,
FALCON, of this theory converges to near-native structures on short
benchmark protein sequences, and produces significantly better protein
structures than the best programs in this field.
Joint work with: Shuaicheng Li, Dongbo Bu, and Jinbo Xu.
2 p.m. Thursday February 7, 2008
Title: Systems Analysis of Pollen
Function: How can bioinformatic and computational approaches reveal
insights?
Speaker: Heven Sze,
Ph.D.
Venue: Biomolecular Science
Building Room 3118
Abstract:
Male fertility depends on the proper development of the male gametophyte,
successful pollen germination, tube growth and delivery of the sperm cells
to the ovule. The dynamic interactions between the growing pollen tube and
the pistil provide an exemplary network that is amenable to
high-throughput approaches for determining gene function at the systems
level. Approximately 10% of the Arabidopsis genome is preferentially or
specifically expressed in pollen, however a significant number of these
genes have yet to be assigned any function. Furthermore, comprehensive
understanding of genes important for successful fertilization could
enhance seed production and enable improved control over plant
reproduction.
Several plant biologists are interested in using a multi-pronged and
integrated approach to functionally analyze Arabidopsis genes that are
specifically or preferentially expressed in pollen. The genes include
those of unknown function (1/4) and those representing functional classes
likely to mediate pollen tube growth and guidance (signaling,
transporters, cytoskeleton, cell wall modification). Proposed approaches
will: 1) Identify and analyze genes specifically-expressed in pollen; 2)
Identify mutants and employ assays to determine the function of each gene
in pollen tube growth and guidance; 3) Determine the sub-cellular
localization of each protein; 4) Identify protein interaction partners in
pollen; and 5) Elucidate networks of co-functional genes using
computational approaches that integrate phenotypic, localization,
transcriptome and protein:protein interaction data.
These studies will constitute important steps toward a comprehensive
understanding of pollen tube growth and guidance and are essential to
reach the goal of understanding the function of all Arabidopsis genes.
References:
Honys D, Twell D (2004) Transcriptome analysis of haploid male
gametophyte development in Arabidopsis. Genome Biol 5: R85
Bock KW, D Honys, JM. Ward, S Padmanaban, EP Nawrocki, KD Hirschi, D
Twell, and H Sze (2006) Integrating Membrane Transport with Male
Gametophyte Development and Function through Transcriptomics. Plant
Physiol. 140, 1151-1168
2 p.m. Thursday February 14, 2008
Title: A Framework for Discovering
Associations from the Annotated Biological Web
Speaker: Woei-Jyh (Adam) Lee
Venue: Biomolecular Science
Building Room 3118
Abstract:
During the last decade, biomedical researchers gained access to the entire
human genome, reliable high-throughput biotechnologies, and affordable
computational resources and network access. In combination, these new
tools created a new model for biomedical research that no longer uses
computational tools merely to monitor research, but instead exploits these
tools to acquire knowledge and make discoveries. We have developed a tool
to discover meaningful patterns across resources and ontologies. These
patterns, corresponding to associations of pairs of CV (controlled
vocabulary) terms, may yield actionable nuggets of previously unknown
knowledge. Moreover, the bridge of associations across CV terms will
reflect the practice of how scientists annotate data.
We execute a protocol to follow hyperlinks, extract annotations, and
generate background LSLink (Life Sciences Link) datasets of
termlinks. We then mine the termlinks to find potentially meaningful
associations. We use two classes of metrics to identify significant
associations of pairs of CV terms. The first class is based on the LOD
(logarithm of the odds) ratio and is a measure of the extent to which a
specific association of CV terms deviates from one resulting from chance
alone (a random association). The second class of metric is based on the
hypergeometric distribution; it gives a quantification of the level of
one's surprise at finding over-representation for a particular pair of
CV terms in a user dataset, in comparison to the background dataset.
We also exploit knowledge of the underlying ontologies. We aggregate
termlinks and develop an extension to confidence and support. We
illustrate where the benefits of exploiting ontology structure using an
experimental dataset of an OMIM record related to some disease that is
hyperlinked to a set of genes records from Entrez Gene that are themselves
hyperlinked to a set of publications from PubMed.
This is the ongoing research with Louiqa Raschid, Hassan Sayyadi and
Padmini Srinivasan.
11 a.m. Tuesday February 19, 2008
Title: Bayesian analysis of
complex biological systems
Speaker: Edo Airoldi, Ph.D.
(Princeton University)
Venue: A.V. Williams Building 2460
Abstract:
Modern technology has transformed the concept of data in the biological,
social and computational sciences. Data collections about a number of
biological systems, for instance, have grown large and heterogeneous, in
terms of both the units of analysis and the measurements on such units.
Measurements are typically collected over time, as non-observable
mechanisms of interest unfold. This increase in the complexity of the
observations, however, has hardly translated into a richer understanding
of mechanisms and principles that can explain them. This problem is
paramount in systems biology, where large-scale and high-throughput
measurements of genes, proteins or enzymes promise fundamental insights
into signaling and metabolic pathways in the cell, and the development of
disease.
In this talk, I will introduce mechanistic models and inference algorithms
for the analysis of complex biological systems with the goals of testing
substantive hypotheses, making predictions, and driving further
experimentation, with a focus on scalability and data integration issues.
In particular, I will demonstrate two advantages of the mechanistic
approach to systems biology. In a first case study, a model of how
proteins interact allows to ground the analysis in the context of accepted
theories and empirical observations about the cell, and posterior
(variational) inference reveals proteins' multifaceted functional role.
This model leads to predictions of cellular events that we can measure;
the goal is to drive experimentation in large event spaces. In a second
case study, a model of cellular growth allows us to identify
growth-specific programs of gene expression and suggests the notion of
"effective growth rate" of a cellular culture. This model opens a window
on cellular events that we cannot measure, by quantifying effective growth
at a finer temporal resolution than that accessible with technology ---
minutes rather than hours. These results contribute to a system-level
understanding of the connections among growth rate, metabolism,
environmental stress response, and the cell division cycle.
Biography:
Edo Airoldi is a postdoctoral fellow at Princeton University, affiliated
with the Department of Computer Science and the Lewis-Sigler Institute
for Integrative Genomics. His research interests include statistical
methodology, machine learning, mechanistic models of complex systems and
random graph dynamics, with application to the biological and social
sciences.
11 a.m. Thursday February 21, 2008
Title: Algorithms for Discovering
Variation with (Short) Reads
Speaker: Michael Brudno, Ph.D.
(University of Toronto)
Venue: Biomolecular Science
Building Room 3118
Abstract:
In this talk I will demonstrate two algorithms for discovering variation
from whole genome shotgun sequencing data. First, I will present the
SHRiMP tool for short read mapping. SHRiMP combines a novel spaced seed
filtering step with a very fast implementation of the Smith-Waterman
algorithm to map letter or color-space reads to a reference genome. We
implement a specialized algorithm for aligning color space reads in letter
space that can handle sequencing errors in a rigorous fashion. Second, I
will present an algorithm for finding structural variations (large scale
insertions, deletions, and inversions in the genome) using clone end
sequence data. Unlike previous methods, our approach does not rely on an a
priori determined mapping of all reads to the reference. Instead, we build
a framework for finding the most probable assignment of sequenced clones
to potential structural variants based not only on the level of sequence
similarity, but also based on the other clones. We use this algorithm to
compare the genome of the JCVI donor individual to the reference NCBI
human genome, isolating a number of indel and inversion events, as well as
a small number of inter-chromosomal events. In particular we will
illustrate a deletion found in the DMBT1 gene of the JCVI donor that has
previously been linked to the progression of cancers.
11 a.m. Monday February 25, 2008
(CBCB Candidate Talk)
Title: Evolution of gene
architecture: challenging the neutral paradigm
Speaker: Liran
Carmel, Ph.D. (Visiting Fellow, NCBI, NLM, National Institutes of
Health)
Venue: A.V. Williams Building Room
2328
Abstract:
Many eukaryotic genes are fragmented along the DNA, intervened by
noncoding segments called introns. Until recently, the prevailing concept
held that introns evolve by nonadaptive forces as slightly deleterious
elements that lack any function. However, evidence has been accumulating
showing that there are anecdotal exceptions to this concept. In my talk, I
will present a thorough large-scale study that implies a functional role
for a previously unappreciated large fraction of introns, highlighting
their importance to the rapidly rising interest in functional noncoding
elements. To this end, I will lay out a comprehensive model for the
evolution of gene architecture, and will introduce an intron-exon data set
that is significantly larger than previously studied ones. To obtain a
definitive reconstruction of gene architecture evolution, we interpret the
phylogenetic tree as a graphical model, and develop an
expectation-maximization algorithm to estimate the parameters of the
model. We use a realization of the junction tree algorithm to compute the
sufficient statistics that is required for the expectation step.
This work culminated in several observations, some of which revise common
beliefs in the field. Taken together, these findings put forward the
possibility that once introns had invaded early eukaryotic genomes in an
arguably nonadaptive fashion, many were exploited in novel ways, gradually
gaining diverse functions, up to the point that probably only a few of
today's eukaryotes could survive without them. The results of this study
were integrated with whole-genome multivariate analysis at the systems
level, showing that genes with high expression level and low sequence
evolutionary rate have a tendency to accumulate introns. These ideas gain
further credence from the fact that the positions of many exon boundaries
are known to be shared between distant eukaryotic taxa, e.g., in my data
set 25% of the intron positions are shared between plants and animals.
This observation can be explained by either remarkable conservation of
ancient introns or by parallel, independent, intron gain at the same
positions. Using my algorithm, a calculation of the relative contributions
of the two factors reveals that shared ancestry is by far the dominant one
(for example, more than 80% of the introns shared by plants and animals
are due to shared ancestry). While a mechanistic explanation cannot be
ruled out, such an impressive endurance of a substantial fraction of the
introns is likely to reflect their functional importance. Consequently, I
suggest using conserved intron positions as a novel tool for identifying
functional noncoding elements.
Biography:
Dr. Carmel received his Ph.D. in Computer Science and Applied Mathematics
from the Weizmann Institute of Science, Israel, in 2004. In his Ph.D., he
studied the mathematics and algorithms of odor digitization and
communication. From 2004, he is a Visiting Fellow in Eugene Koonin's
molecular evolution group at NIH/NLM/NCBI. There, he studies many topics
of evolutionary genomics. In particular, he has investigated questions
regarding eukaryote's gene architecture.
11 a.m. Tuesday March 4, 2008
(CBCB Candidate Talk)
Title: How do Genes Evolve?
Computational Approaches for Investigating Molecular Evolutionary
Heterogeneity
Speaker: Bryan
Kolaczkowski, Ph.D. (Postdoctoral Research Scientist, University of
Oregon's Center for Ecology and Evolutionary Biology)
Venue: A.V. Williams Building Room
2460
Abstract:
One of the principle problems in modern biology is understanding how
evolutionary changes in molecular sequence lead to functional and
phenotypic differences among species. It is impossible to examine all
molecular changes experimentally due to the overwhelming volume of
sequence data, so computational methods are used to predict which changes
are likely to be functionally important and which are not. Here I
introduce a powerful new method for predicting functionally important
evolutionary changes. The method is based on a heterogeneous phylogenetic
model of molecular evolution and stems from the observation that
evolutionary forces act differently at different positions in the molecule
and regularly change over time. These changes in evolutionary pressures
leave a signature at the sequence level that can be used to understand how
the molecules have evolved. I show that this heterogeneous model provides
an improved fit to empirical sequence data compared to existing models and
improves the quality of inferred evolutionary relationships, providing a
more accurate framework for interpreting comparative results. I also show
how this model can predict the specific evolutionary forces acting at each
position in the molecule, providing a detailed description of molecular
evolution that can be used to make functional predictions. Complex
evolutionary models like the one I describe pose significant computational
challenges, particularly increased algorithmic complexity and the
potential for model overfitting. I highlight some of these potential
drawbacks and the techniques I am using to address them. Heterogeneous
models like the one I have developed can potentially drive scientific
discovery by sifting through copious molecular sequence data to generate
strong testable hypotheses.
Biography:
In 2006, Bryan received his PhD in computer science from the University of
Oregon, where he was trained in both computer science and biology through
an NSF IGERT program in evolution, development and genomics. Since then,
Bryan has been working as a postdoctoral scientist at the University of
Oregon's Center for Ecology and Evolutionary Biology. Bryan's research
focuses on developing statistical and computational methods for
investigating molecular evolution and using these methods to address
important biological questions.
2 p.m. Wednesday March 5, 2008
(Cell Biology and Molecular Genetics
Special Seminar)
Title: Integrated genomics and
computational systems biology for tuberculosis
Speaker: James Galagan,
Ph.D. (Associate Director of Microbial Genome Analysis, Broad
Institute of MIT and Harvard)
Venue: Biosciences Research
Building Room 1103
Abstract:
The combination of computational biology and genomic technology is
providing new approaches to the study of microbiology and infectious
disease. For the first time we are in a position construct a
comprehensive view of the molecular networks underlying microbial
physiology and pathogenesis. This in turn promises to have a direct
impact on public health and clinical care. In this talk I will describe
computational genomic approaches to studying the metabolic and genetic
networks of Mycobacterium tuberculosis , the causative agent of TB.
Metabolic changes are a critical component of TB pathogenesis and latency.
And many first line TB drugs target metabolism. To better study TB
metabolism, we developed computational methods to model metabolic
networks. In particular we have developed a novel approach to coupling
expression array data with computational flux balance analysis to predict
metabolic state from gene expression state. I will describe this method
and our application of it to predict the metabolic impact of drugs and
environmental conditions on mycolic acid biosynthesis. I will also
describe a related method we developed to predict nutrient and
environmental conditions from metabolic state predictions a potential
tool for investigating the phagosomal environment within which the
intracellular pathogen TB lives.
Metabolic changes are orchestrated by gene regulatory changes, and I seek
to understand the regulatory programs at work during TB pathogenesis at
the level of genes, operons, and the full regulatory network. I will
describe our work to computationally annotate genes and operons, and in
particular the development of a method for sequence annotation using
Conditional Random Fields. I will also describe our strategy to combine
comparative analysis, expression mining, and Chip-Seq to map TB regulatory
networks. Ultimately, my goal is to derive an integrated model of TB
metabolism and gene regulation that can be used to address fundamental
questions concerning TB pathogenesis, persistence, and drug resistance.
Biography:
James Galagan is Associate Director of Microbial Genome Analysis at the
Broad Institute of MIT and Harvard. He oversees the analysis of genomic
data generated by the Microbial Sequencing Center and the Fungal Genome
Initiative. His group develops and applies computational and genomic
methods to study regulation and evolution with particular focus on the
biology of infectious disease. Current projects include comparative
analysis of Plasmodium spp., Mycobacteria tuberculosis, and Cryptococcus
spp. In addition, James has NSF support to develop and apply computational
tools for comparative fungal genome analysis.
11 a.m. Thursday March 6, 2008
(CBCB Candidate Talk)
Title: Integrative structural and
functional analysis to study proteins and their interactions
Speaker: Anna Panchenko,
Ph.D. (Associate Investigator, NCBI, NLM, National Institutes of
Health)
Venue: A.V. Williams Building Room
2460
Abstract:
While several hundreds of complete genomes have been sequenced so far, the
biological roles of many gene products remain uncharacterized. The
integrative approach combining sequence, structural and functional
analyses of proteins will make it possible to tackle this problem on a
large scale. One aspect which complicates such analyses is that proteins
are under constant scrutiny imposed by natural selection and must be
viewed within the context of evolutionary history. To this end we analyze
the patterns of evolutionary conservation for different protein regions
(cores and loops) and identify useful metrics for detecting remote
evolutionary relationships, finding interesting structural similarities
and predict the locations of functionally important sites.
Recent studies have shown that genomes are extremely complex, with the
numerous gene products working together to perform specific cellular
functions. Indeed, it is evident now that the vast majority of proteins
interact with multiple partners and form intricate interaction networks.
We analyze the conservation and diversity of protein-protein interactions
on the example of domain-domain interactions in the structural databank,
infer the biological roles of interaction interfaces, model the evolution
of different domain rearrangements and compile a set of interacting
domains which can be used in homology modeling.
11 a.m. Tuesday March 11, 2008
(CBCB Candidate Talk)
Title: Computational Metagenomics:
Algorithms for Understanding the "Unculturable" Microbial Majority
Speaker: Sourav Chatterji,
Ph.D. (Postdoctoral Scholar, Genome Center, University of California
at Davis)
Venue: A.V. Williams Building Room
2460
Abstract:
Metagenomics, the application of genome sequencing techniques to
unculturable microbial communities is revolutionizing microbiology and has
shed light on the role of these communities in our environment as well as
human health. Unlike traditional genomics data, a metagenomic data-set is
made up of sequence reads from multiple species with varying relative
abundance. Consequently, metagenomic data is mosaic and fragmentary in
nature, necessitating the development of new methods for their analysis.
My presentation will describe computational methods that we have developed
for analyzing metagenomic data. First, I will introduce CompostBin, a DNA
composition based algorithm that we have developed for classifying
metagenomic sequences into taxa-specific bins. Then, I will discuss a
pipeline for phylogenomic analysis of metagenomic data. Finally, I will
discuss how these methods fit into the big picture of the profiling of
microbial communities.
Biography:
Sourav Chatterji a Postdoctoral Scholar at the UC Davis Genome Center . He
graduated from UC Berkeley with a Ph. D. in Computer Science and a
Designated Emphasis(minor) in Computational and Genomic Biology. One of
his main contributions there was the development of GeneMapper, a
reference based genome annotation program. GeneMapper has been used
extensively for annotating genomes and studying genome evolution across a
variety of eukaryotic clades. More recently, he has been working with
Prof. Jonathan Eisen at UC Davis on computational methods in metagenomics.
11 a.m. Wednesday March 12, 2008
(CBCB Candidate Talk)
Title: Estimating the significance
of sequence motifs
Speaker: Uri Keich, Ph.D. (Assistant
Professor, Department of Computer Science, Cornell University)
Venue: A.V. Williams Building Room
2460
Abstract:
The identification of transcription factor binding sites is an important
step in understanding the regulation of gene expression. To address this
need, many motif-finding tools have been described that can find short
sequence motifs given only an input set of sequences. Our talk is
dedicated to the computational analysis of the significance of the motifs
reported by those motif finders. Somewhat surprisingly, development of
this significance analysis has lagged considerably behind the extensive
development of the finders themselves. Nevertheless, this analysis is
often the only information available to biologists when deciding whether
or not to invest the resources required to verify the predictions of those
finders.
Biography:
Uri Keich has been an assistant professor in the Department of Computer
Science at Cornell University since 2003. He has won an NSF CAREER award
as well as the Wilhelm T. Magnus Memorial Prize for Significant
Contributions to the Mathematical Sciences given by the Courant Institute
at NYU.
2 p.m. Thursday March 27, 2008
Title: Two-Sided Relative Ranking:
A Robust Indirect Similarity Measure for Gene Expression Data
Speaker: Louis Licamele
Venue: Biomolecular Science
Building Room 3118
Abstract:
There is a wealth of gene expression data available in the public domain.
However, because of variations in experimental conditions, it is often
difficult to combine information across the different sources. In this
paper we present a new method, which we refer to as indirect two-sided
relative ranking, for comparing gene expression probes which is robust to
variations in experimental conditions. This method extends the current
best approach that is based on comparing the correlations of the up and
down regulated genes with a comparison based on the correlations in
rankings across the entire database. We evaluate the ability of this
method to retrieve compounds with similar therapeutic effects across known
experimental barriers, namely vehicle and batch effects, on two different
datasets. We show that our indirect method is able to improve upon the
previous state of the art method on both datasets with a substantial
improvement in ranked recall of 97.03% and 49.44% respectively.
11 a.m. Tuesday April 1, 2008
(CBCB Candidate Talk)
Title: Machine learning approaches
for understanding the genetic basis of complex traits
Speaker: Su-In Lee (PhD candidate,
Department of Computer Science, Stanford University)
Venue: A.V. Williams Building Room
2460
Abstract:
Humans differ in many "phenotypes" such as weight, hair color and more
importantly disease susceptibility. These phenotypes are largely
determined by each individual's specific genotype, stored in the 3.2
billion bases of his or her genome sequence. Deciphering the sequence by
finding which sequence variations cause a certain phenotype would have a
great impact. The recent advent of high- throughput genotyping methods has
enabled retrieval of an individual's sequence information on a genome-wide
scale. Classical approaches have focused on identifying which sequence
variations are associated with a particular phenotype. However, the
complexity of cellular mechanisms, through which sequence variations cause
a particular phenotype, makes it difficult to directly infer such causal
relationships. In this talk, I will present machine learning approaches
that address these challenges by explicitly modeling the cellular
mechanisms induced by sequence variations. Our approach takes as input
genome-wide expression measurements and aims to generate a finer-grained
hypothesis such as "sequence variations S induces cellular processes M,
which lead to changes in the phenotype P". Furthermore, we have developed
the "meta-prior algorithm" which can learn the regulatory potential of
each sequence variation based on their intrinsic characteristics. This
improvement helps to identify a true causal sequence variation among very
many variations in the same chromosomal region. Our approaches have led to
novel insights on sequence variations, and some of the hypotheses have
been validated through biological experiments. Many of the machine
learning techniques are generally applicable to a wide-ranging set of
applications, and as an example I will present the meta-prior algorithm in
the context of movie rating prediction tasks using the Netflix data set.
Biography:
Su-In Lee is a Ph.D. candidate at Stanford University, where she is a
member of the Stanford Artificial Intelligence Laboratory. Her research
focuses on devising computational methodologies for understanding the
genetic basis of complex traits. She is also interested in developing
general machine learning algorithms for broader applications. Su-In
graduated Summa Cum Laude with a B.Sc. in Electrical Engineering and
Computer Science from Korea Advanced Institute of Science and Technology
and was a recipient of the Stanford Graduate Fellowship.
2 p.m. Thursday April 3, 2008
Title: Using Similarity Flooding
for Extracting Similar Parts of Proteins
Speaker: Hassan Sayyadi
Venue: Biomolecular Science
Building Room 3118
Abstract:
Proteins are the main players in the game of life. Good understanding of
their structures, functions, and behaviors leads to good understanding of
drugs, diseases, and thus our health. So, much effort has been done to
study and categorize proteins. Nowadays, tens of thousands of proteins
have been found. Moreover, the problem of comparing the proteins is hard.
Therefore, efficient methods are needed to deal with this problem. We use
an important computational geometric concept and graph matching algorithm,
namely, "Delaunay Tetrahedralization" and "Similarity Flooding", and
propose a new idea to extract similar parts of proteins. Furthermore, we
used protein fragmentation to reduce the time and storage complexity of
the model for larger proteins.
2 p.m. Thursday April 10, 2008
Title: The iPlant Collaborative
inaugural conference "Bringing Plant and Computing Scientists Together
to Solve Plant Biology's Grand Challenges"
Speaker: Stephen M. Mount, Ph.D.
Venue: Biomolecular Science
Building Room 3118
Abstract:
The purpose of this conference is to 1) explain the nature of the project
and 2) facilitate community discussion of what are the most compelling
grand challenges, as well as the data, computational tools, and
cyberinfrastructure necessary to solve those grand challenges. (presentation)
2 p.m. Thursday April 17, 2008
Title: Improving Reliability of
Peptide Identification by Statisitcal Machine Learning
Speaker: Xue Wu
Venue: Biomolecular Science
Building Room 3118
Abstract:
Peptide identification by tandem mass spectrometry (MS/MS) is the dominant
proteomics workflow for protein characterization in complex samples. In
this talk, I present two approaches for improving peptide identification
reliability using statistical machine learning.
HMMatch is a hidden Markov model approach to spectral matching, in which
many examples of a peptide's fragmentation spectrum are summarized in a
generative probabilistic model that captures the consensus and variation
of each peak's intensity.
PepArML (Peptide Identification Arbiter by Machine Learning) is a machine
learning based algorithm for unifying current peptide identification
softwares. It provides better specificity and sensitivity by effectively
utilizing multiple tandem MS search engines and additional spectra
features. We demonstrate that both approaches achieved better accuracy
compared with popular peptide identification softwares by extracting and
using more information hidden in the protein tandem mass spectra.
2 p.m. Thursday April 24, 2008
Title: Statistical Methods for
Detecting Differentially Abundant Taxa in Metagenomic Samples
Speaker: James Rebert White
Venue: Biomolecular Science
Building Room 3118
Abstract:
Numerous studies are currently underway to characterize the microbial
communities inhabiting our world. These studies will dramatically expand
our understanding of the microbial biosphere and, more importantly, will
reveal the secrets of the complex symbiotic relationship between us and
our commensal bacterial communities. An important prerequisite for such
discoveries are computational tools able to rapidly and accurately compare
large datasets generated from complex bacterial communities. I
will describe a statistical method for detecting differentially abundant
organisms between two populations using count data (e.g. 16S rRNA
surveys). In high-complexity environments, our method employs the false
discovery rate to improve specificity and properly handles low abundance
taxa. To demonstrate the use of our tool, I shall present comparisons of
publicly available human and mouse gut microbiome datasets, identifying
differences between these bacterial populations at different levels of
resolution. Furthermore, we have re-analyzed the data generated in a
recent study on obesity and identify a previously uncharacterized
difference between the gut flora of obese and lean human subjects.
2 p.m. Thursday May 1, 2008
Title: Protein interaction
networks in viruses and bacteria: going beyond eukaryotes
Speaker: Peter Uetz, Ph.D. (J. Craig Venter
Institute)
Venue: Biomolecular Science
Building Room 3118
Abstract:
Protein interaction mapping studies have previously focused on yeast and
other eukaryotes. However, more recently, several groups have published
more or less "comprehensive" protein interaction datasets for bacteria,
namely E. coli, Campylobacter jejuni, and Treponema pallidum. In addition,
there are "incomplete" datasets for several other species. While yeast
data have provided enough data for the development of many bioinformatics
tools to analyze this data and combine, correlate, and compare it to other
data, such studies are much less developed in bacteria. I will present
data from our own project on Treponema pallidum, the syphilis spirochete,
and discuss what we can (and cannot) learn from it.
We have also done various projects on systematic protein interaction
mapping in human viruses and are about to start similar projects for
bacteriophage. Although we are still at the beginning, such studies will
provide a starting point for host-pathogen systems biology.
2 p.m. Thursday May 8, 2008
Title: (Re)-assembly of the cow
genome
Speaker: Guillaume Marcais
Venue: Biomolecular Science
Building Room 3118
Abstract:
We present here a new assembly of the cow genome and some of the methods
used to create it. This new assembly is of better (both qualitatively and
quantitatively) than the previous one done by the Baylor College of
Medicine.
Title: Improving Draft Assemblies
using Existing Data
Speaker: Poorani Subramanian
Venue: Biomolecular Science
Building Room 3118
Abstract:
We will introduce a simple algorithm for closing gaps and fixing
misassemblies in draft genomes.
2 p.m. Thursday May 15, 2008
Title: Development of a
phylogenomics pipeline for the analysis of genomic data from the
haptophyte Emiliania huxleyi and est data from dinoflagellates
Speaker: John J. Miller
Venue: Biomolecular Science
Building Room 3118
Abstract:
Haptophytes and dinoflagellates are prominent members of marine
phytoplankton and are responsible for a significant portion of global
primary productivity. Both groups have unique cytological features
including secondary or tertiary plastids. I have been working on an
analytical pipeline to sift through the genome of the haptophyte
Emiliania huxleyi and est data from various dinoflagellates and
produce preliminary phylogenetic analyses. I will talk about the
development of my pipeline and present preliminary results. Several
problems still exist one of which is that many queries get few if any
hits. Another problem is that the resulting trees frequently include
aberrantly placed long branching taxa. Misplaced long branching taxa may
result from a reversed sequence polarity or may consist of a limited
portion of the gene sequence.
|
|