CBCB Seminar Series
Fall 2006
2 p.m. Thursday August 31, 2006
Title: organizational meeting
By: Stephen M. Mount, Ph.D.
Venue: Biomolecular Science
Building Room 3118
Abstract: To discuss the
schedule in Fall 2006.
2 p.m. Thursday October 5, 2006
Title: Trees, graphs, and other
visualizations for evolution and ecology
Speaker: Cynthia Sims Parr, Ph.D.
Venue: Biomolecular Science
Building Room 3118
Abstract:
Interactive visualizations help biologists to conduct research at
increasingly large scales. At the Human-Computer Interaction Laboratory we
have been designing and testing new techniques for interacting with real
datasets in a variety of domains. I focus here on datasets and tasks
useful for evolutionary and computational ecology. TaxonTree uses
animation and zooming to support incremental exploration and searching of
very large (>100,000 node) phyogenetic and taxonomic trees. This
application is currently used as a browsing interface to an online
encyclopedia and is being extended to support browsing of multiple
ontologies describing Lepidoptera relationships. DoubleTree couples
navigation of two trees with different topologies to aid comparison of
local branching differences. The metaphor "Plant a seed and watch it grow"
guided our development of TreePlus, a graph visualization using an
incremental tree-layout approach to support label-based exploration tasks
in networks. TreePlus has been used with food web and gene ontology
datasets. To manage and explore hundreds of mid-sized food web datasets,
we developed EcoLens, later generalized as NetLens. These database
visualizers provide multiple views onto highly complex datasets. Our
"taskonomy" for graph visualization provides a framework for future tool
development and comparison.
3:30 p.m. Tuesday October 10, 2006
Title: Improving motif finders
with faster and more accurate E-value estimates
Speaker: Niranjan Nagarajan, Ph.D.
(Cornell)
Venue: Biomolecular Science
Building Room 3118
Abstract:
Motif finding programs have been widely studied in computational biology
due to their application in a wide variety of sequence analysis tasks on
genomic and proteomic data. Typically, motif finding programs such as
CONSENSUS and MEME rely on optimizing an entropy score to find interesting
motifs. In practise, the motifs are then evaluated by a measure of
statistical significance such as an E-value to filter out false positives.
We show that the approximations used for computing E-values in motif
finders such as CONSENSUS and MEME can be quite far from the true values.
We instead propose a new algorithm using Fourier Transform based
techniques that can accurately and efficiently compute E-values. We then
apply these techniques to several motif finders to show that optimizing
E-values rather than the entropy score can significantly improve their
performance. Extending this idea to other motif models and scoring
functions is an interesting avenue for future research.
This talk is based on joint work with Uri Keich, Neil Jones and Patrick
Ng.
(This is a postdoctoral candidate talk.)
2 p.m. Thursday October 12, 2006
Title: Genomic analysis of the
biomass conversion systems of the marine bacterium Saccharophagus
degradans 2-40
Speaker: Steven
W. Hutcheson, Ph.D.
Venue: Biomolecular Science
Building Room 3118
Abstract:
Saccharophagus degradans 2-40 (Sde2-40) is an aerobic, gamma
subgroup proteobacterium (order, Alteromonadales) that can rapidly
decompose diverse whole plant materials as well as cellulosic biomass in
monoculture. It expresses multiple enzyme systems to degrade at least 11
different complex polysaccharides (CPs), including agar (agarose),
alginate, cellulose, chitin, fucoidan, laminarin, mixed b-glucans, pectin,
pullulan, starch and xylan. It also synthesizes several proteases and
lipases. To identify the genes for the functional carbohydrases acting on
these complex carbohydrases, the complete Sde2-40 genome sequence
was determined by DOE -JGI. The 5.05 Mb genome encoded 4009 gene models,
a comparatively low gene density for this group of bacteria. At least 111
gene models were identified that either contained a homolog of a known
glycoside hydrolase (GH) domain and/or a carbohydrate-binding module (CBM)
typical of carbohydrases. Collectively, 31 different classes of GH
domains were identified in the predicted carbohydrases. Through genetic,
proteomic and biochemical analyses, functional elements of agarolytic,
chitinolytic and cellulolytic systems have been characterized. Each of
these environmentally regulated systems utilizes freely secreted and
surface-associated enzymes to degrade the substrate and vector mono-, di-,
and oligo-saccharide products to the cell through strategic placement of
enzymes. Freely secreted enzymes of each system tend to be endo-acting
enzymes with multiple CBMs. At least one enzyme in each degradative
system appears to be an epicellular lipoprotein that has been demonstrated
or is predicted to be exo-acting enzyme. A phosphorylic pathway for
cellulose degradation is proposed. Many of these enzymes appear to have
been acquired by the naturally competent Sde2 -40 through
horizontal gene transfer or by domain shuffling. Several superintegrons
and associated satellites were also identified in the genome that involve
200 kb or more of the genome and appear to consist of recently acquired
DNA fragments.
6 p.m. Tuesday October 17, 2006
Title: NCBI's RefSeq and Entrez
Gene: a case study
Speaker: Donna Maglott, Ph.D.
(NIH/NLM/NCBI)
Venue: Computer Science
Instructional Center Room 2118
Abstract:
NCBI's Reference Sequence (RefSeq) collection is designed to provide a set
of standard, non-redundant sequences of genomes, RNAs and proteins of
major research organisms. These sequences are annotated as appropriate
with major features of interest, including genes, mRNAs, and coding
regions. An early consequence of the RefSeq project, therefore, was the
development of methods to identify and track genes and their attributes
with each sequence update. First made public as LocusLink, gene-specific
data are now reported through Entrez Gene. This talk will provide a brief
history of RefSeq, LocusLink, and Entrez Gene. Current data flows will be
discussed, including (1) gene definition from expressed sequences vs. from
genomic annotation, (2) integration of gene-specific attributes from
public data bases, and (3) curation vs. computation.
http://www.ncbi.nlm.nih.gov/RefSeq/
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene
(This is a guest talk at CMSC828U.)
Biography:
Dr. Maglott earned her Ph.D. in 1971 at the University of Michigan. Her
dissertation was on the structure and function of the 50S subunit of the
E. coli ribosome. After an extensive post-doc in early child care, she
had an academic position at Howard University where she worked on the
proteomics of that time, namely looking for changes in protein synthesis
in early sea urchin development via 2D gel electrophoresis. In 1986, she
accepted a position at the American Type Culture Collection in Rockville,
MD. where she got in on the ground floor of database development
supporting genomic research. Her major functions there was to develop and
maintain relational databases to describe molecular reagents (probes,
vectors, recombinant hosts, clones) and their targets (genes, loci, and
polymorphisms). In 1998 she joined the staff of NCBI, where she developed
the databases to track the processing of sequences for the RefSeq project,
and to capture gene-specific attributes. Although her primary
responsibilities are currently Entrez Gene and RefSeq support, she also
contributes to NCBI's genome annotation pipelines, Map Viewer and hosting
OMIM.
2 p.m. Thursday October 19, 2006
Title: Programmed ribosomal
frameshifting: it's not just for viruses any more
Speaker: Jonathan Dinman, Ph.D.
Venue: Biomolecular Science
Building Room 3118
Abstract:
In viruses, programmed -1 ribosomal frameshifting (-1 PRF) signals direct
the translation of alternative proteins from a single mRNA. Given that
many basic regulatory mechanisms were first discovered in viral systems,
the current study endeavored to: 1) identify -1 PRF signals in genomic
databases, 2) apply the protocol to the yeast genome, and 3) test selected
candidates at the bench. Computational analyses revealed the presence of
10,340 consensus -1 PRF signals in the yeast genome. Of the 6,353 yeast
ORFs, 1,275 contain at least one strong and statistically significant -1
PRF signal. Eight out of nine selected sequences promoted efficient levels
of PRF in vivo. These findings provide a robust platform for high
throughput computational and laboratory studies and demonstrate that
functional -1 PRF signals are widespread in the genome of S. cerevisiae.
The data generated by this study have been deposited into a publicly
available database called the PRFdb. The presence of stable mRNA
pseudoknot structures in these -1 PRF signals, and the observation that
the predicted outcomes of nearly all of these genomic frameshift signals
would direct ribosomes to premature termination codons, suggest two
possible mRNA destabilization pathways through which -1 PRF signals could
post-transcriptionally regulate mRNA abundance.
2 p.m. Thursday October 26, 2006
Title: Designing Tools for
cDNA-to-Genome Alignment
Speaker: Liliana Florea, Ph.D. (George
Washington University)
Venue: Biomolecular Science
Building Room 3118
Abstract:
Accurately and efficiently aligning cDNA sequences to a whole genome,
either from the same species or from a close relative, is a critical
component of any gene annotation project. We start by presenting our work
in designing cDNA-to-genome alignment programs to address these needs. One
important choice in designing alignment programs is the selection of
seeds, with spaced seeds recently emerging as the primary vehicle for
increasing alignment sensitivity. We describe our preliminary efforts in
selecting mathematically sensitive and specific spaced seeds, starting
from codon and mutation-sensitive models of alignments and sequences, and
suggest how they can be used to increase the accuracy of cDNA-to-genome
alignment programs.
Biography:
Dr. Liliana Florea is an Assistant Professor in the Computer Science
Department at the George Washington University, with specialty in
Computational Biology and Bioinformatics, and a member of the Biochemistry
Department faculty at the GWU Medical School. Prior to joining the George
Washington University in 2005 she was a Senior Scientist at Celera
Genomics and Applied Biosystems. Her research and interests revolve around
applying sequence analysis techniques to genome comparison, automatic gene
annotation, comparative genomics, analysis of alternative splicing and its
regulation, and computational vaccine design. She holds a Ph.D. degree in
Computer Science and Engineering from the Penn State University (2000).
2 p.m. Thursday November 2, 2006
Title: Randomized Motion Planning:
From Intelligent CAD to Computer Animation to Protein Folding
Speaker: Nancy Amato, Ph.D. (Texas A&M)
Venue: A.V. Williams Building
Room 3258
Abstract:
Motion planning arises in many application domains such as computer
animation (digital actors), mixed reality systems and intelligent CAD
(virtual prototyping and training), and even computational biology and
chemistry (protein folding and drug design). Surprisingly, a single class
of planners, called probabilistic roadmap methods (PRMs), have proven
effective on problems from all these domains. Strengths of PRMs, in
addition to versatility, are simplicity and efficiency, even in
high-dimensional configuration spaces.
In this talk, we describe the PRM framework and give an overview of
several PRM variants developed in our group. We describe in more detail
our work related to virtual prototyping, computer animation, and protein
folding. For virtual prototyping, we show that in some cases a hybrid
system incorporating both an automatic planner and haptic user input leads
to superior results. For computation animation, we describe new PRM-based
techniques for planning sophisticated group behaviors such as flocking and
herding. Finally, we describe our application of PRMs to simulate
molecular motions, such as protein and RNA folding. More information
regarding our work, including movies, can be found at
http://parasol.tamu.edu/~amato/.
Biography:
Nancy M. Amato is a professor of Computer Science at Texas A&M University.
She received B.S. and A.B. degrees in Mathematical Sciences and Economics,
respectively, from Stanford University, and M.S . and Ph.D. degrees in
Computer Science from UC Berkeley and the University of Illinois at
Urbana-Champaign, respectively. She was an AT&T Bell Laboratories PhD
Scholar, she is a recipient of a CAREER Award from the National Science
Foundation, and she is a Distinguished Lecturer for the IEEE Robotics and
Automation Society. She served as an Associate Editor of the IEEE
Transactions on Robotics and Automation and of the IEEE Transactions on
Parallel and Distributed Systems, she serves on review panels for NIH and
NSF, and she regularly serves on conference organizing and program
committees. She is a member of the Computing Research Association's
Committee on the Status of Women in Computing Research (CRA-W) and she
co-directs the CRA-W's Distributed Mentor Program
(http://www.cra.org/Activities/craw/dmp/).
Her main areas of research focus are motion planning, computational
biology and geometry, and high-performance computing. Current projects
include the development of a new technique for approximating protein
folding pathways and energy landscapes, and STAPL, a parallel C++ library
enabling the development of efficient, portable parallel programs.
2 p.m. Thursday November 9, 2006
Title: Finding Motifs in Sequence
Data: Application to Splice Site Prediction
Speaker: Rezarta Islamaj
Venue: Biomolecular Science
Building Room 3118
Abstract:
Sequence data in most domains contains useful 'signals' or features that
enable the correct construction of classification algorithms. Extracting
and interpreting these features is a difficult problem. In the first part
of the talk I will review our approach to feature generation in sequence
data. This is an integrated process, which allows us to systematically
search a large space of potential features. We show that predictive models
built using our feature generation algorithm for splice site prediction
achieve a significant improvement in accuracy over existing
state-of-the-art approaches.
Achieving a good performance is one criteria for evaluation; a more
important aspect is understanding the signals which play the important
roles. In the second part of the talk I will present our ongoing work on
feature browsing/visualization tool. With this tool the user can view and
explore different subsets of features that are generated by our method.
Each of the identified feature sets can be easily searched, ranked and
displayed. For each group, the user can browse the discovered clusters. We
show examples of the observed clusters and describe our preliminary
efforts to detect biological signals that may be important for the
splicing process.
2 p.m. Thursday November 30, 2006
Title: Investigations of
multipartite Rhizobiaceae genomes
Speaker: João Carlos Setubal,
Ph.D. (Virginia Bioinformatics Institute)
Venue: Biomolecular Science
Building Room 3118
Abstract:
The Rhizobiaceae are a subgroup of alphaproteobacteria that includes
the genera Agrobacterium and Rhizobium. Recently two new Rhizobium
genomes have been published (R. etli and R. leguminosarum). Two new
Agrobacterium genomes (A. vitis and A. radiobacter) will soon be
published. An interesting feature of the Rhizobiaceae is that it
includes both plant pathogens and symbionts. In addition, and perhaps
not uncoincidentally, these genomes have an unusual architecture, with
one chromosome and several large secondary replicons or
plasmids. Sequence comparison of these replicons shows that the
chromosomes share a clear backbone, but the evolutionary history of
the secondary large replicons is much less clear, and therefore
presents an interesting challenge for ancestral sequence
reconstruction. In this talk I will describe preliminary results on
these comparisons and discuss my attempts at inferring the
evolutionary events leading to the present genomic configuration.
Biography:
João Setubal is associate professor and deputy director at the
Virginia Bioinformatics Institute and associate professor in Virginia
Tech's Department of Computer Science. He received his Ph.D. in computer
science in 1992 from the University of Washington. Before joining VBI,
Setubal served as an assistant and associate professor at the University
of Campinas' Institute of Computing in Brazil from 1992 to 2004 and was a
visiting research scholar in the Department of Genome Sciences at the
University of Washington from 2000 to 2001.
Setubal's research interests are in the area of computational tools for
genome annotation and analysis. Since 1997, he has worked primarily in the
areas of bioinformatics support and analysis of bacterial genome projects,
including Xylella, Xanthomonas, Leptospira, and Leifsonia. He has led the
development of tools of various kinds, such as a genome contig scaffolder,
a bacterial genome annotation system, and database models for genomic
data. Some of his active projects include genome annotation for the
Agrobacterium and Azotobacter sequencing projects, and gene ontology term
development for plant-associated microbes. In addition, he leads efforts
for VBI's PATRIC (PathoSystems Resource Integration Center) project, a
large genomics database initiative funded by the National Institutes for
Allergic and Infectious Diseases.
2 p.m. Thursday December 7, 2006
Title: Systems approach for
understanding cellular responses to gamma radiation in Halobacterium NRC-1
using Cytoscape
Speaker: Bo Liu
Venue: Biomolecular Science
Building Room 3118
Abstract:
Organisms of the phylogenetic domain Archaea are environmentally
ubiquitous and typically represent ~10% of the microbiota. However in
extreme environments, such as high temperature or salinity, archaea
dominate the microbial population. Halobaterium sp. NRC-1 is highly
resistant to gamma radiation and is able to repair extensive double strand
DNA breaks (DSBs) in its genomic DNA produced by gamma radiation. But from
its genomic sequence and previous research, no novel proteins, factors or
pathways have been reported that may account for this unique property.
Systems approaches enable the elucidation of global physiological
responses to gamma radiation. We have attempted to address this issue
through a systems level study of Halobacterium NRC-1 response to gamma
radiation using whole genome mRNA microarray analysis using Cytoscape.
5:30 p.m. Tuesday December 12, 2006
Title: Provenance in Scientific
Workflows: ZOOM with user views
Speaker: Sarah Cohen-Boulakia
(University of Pennsylvania)
Venue: Computer Science
Instructional Center Room 3118
Abstract:
Scientific experiments are becoming increasingly large and complex,
with a commensurate increase in the amount and complexity of data
that is generated. Data, both intermediate and final results, is
derived by chaining and nesting together multiple database searches
and analytical tools. In many cases, the means by which the data are
produced is not known, making the data difficult to interpret and
the experiment impossible to reproduce. Provenance in scientific
workflows is thus of paramount importance. ZOOM*UserViews presents
a formal model of provenance for scientific workflows that is simple,
generic, and yet sufficiently expressive to answer questions of data
and step provenance that have been encountered in a large variety of
scientific case studies. In addition, ZOOM builds on the concept of
composite step-classes -- or sub-workflows -- which is present in
many scientific workflow systems to develop a notion of user views.
This talk discusses the design and implementation of ZOOM in the
context of queries encountered in a number of case studies and
posed by the first provenance challenge. We will show how user
views affect the level of granularity at which provenance information
can be seen and reasoned about.
Biography:
Sarah Cohen-Boulakia is a post-doctoral researcher at the University of
Pennsylvania where she works with Prof. Susan Davidson. She defended her
PhD in Computer Science in 2005, under the supervision of Prof. Ch.
Froidevaux at the Laboratoire de Recherche en Informatique, University of
Paris-Sud 11, France. Dr. Cohen Boulakia's research interests are in the
design and application of integration systems dedicated to biological and
biomedical domain. She is best known for her work on BioGuide and
techniques for supporting biologists navigate the maze of biological
resources available over the web. In this work, she collaborates closely
with biologists, physicians, and computer scientists. More information at
http://www.seas.upenn.edu/~sarahcb.
|
|