CBCB Seminar Series

Fall 2006

2 p.m. Thursday August 31, 2006

Title: organizational meeting
By: Stephen M. Mount, Ph.D.
Venue: Biomolecular Science Building Room 3118
Abstract: To discuss the schedule in Fall 2006.

2 p.m. Thursday October 5, 2006

Title: Trees, graphs, and other visualizations for evolution and ecology
Speaker: Cynthia Sims Parr, Ph.D.
Venue: Biomolecular Science Building Room 3118

Interactive visualizations help biologists to conduct research at increasingly large scales. At the Human-Computer Interaction Laboratory we have been designing and testing new techniques for interacting with real datasets in a variety of domains. I focus here on datasets and tasks useful for evolutionary and computational ecology. TaxonTree uses animation and zooming to support incremental exploration and searching of very large (>100,000 node) phyogenetic and taxonomic trees. This application is currently used as a browsing interface to an online encyclopedia and is being extended to support browsing of multiple ontologies describing Lepidoptera relationships. DoubleTree couples navigation of two trees with different topologies to aid comparison of local branching differences. The metaphor "Plant a seed and watch it grow" guided our development of TreePlus, a graph visualization using an incremental tree-layout approach to support label-based exploration tasks in networks. TreePlus has been used with food web and gene ontology datasets. To manage and explore hundreds of mid-sized food web datasets, we developed EcoLens, later generalized as NetLens. These database visualizers provide multiple views onto highly complex datasets. Our "taskonomy" for graph visualization provides a framework for future tool development and comparison.

3:30 p.m. Tuesday October 10, 2006

Title: Improving motif finders with faster and more accurate E-value estimates
Speaker: Niranjan Nagarajan, Ph.D. (Cornell)
Venue: Biomolecular Science Building Room 3118

Motif finding programs have been widely studied in computational biology due to their application in a wide variety of sequence analysis tasks on genomic and proteomic data. Typically, motif finding programs such as CONSENSUS and MEME rely on optimizing an entropy score to find interesting motifs. In practise, the motifs are then evaluated by a measure of statistical significance such as an E-value to filter out false positives. We show that the approximations used for computing E-values in motif finders such as CONSENSUS and MEME can be quite far from the true values. We instead propose a new algorithm using Fourier Transform based techniques that can accurately and efficiently compute E-values. We then apply these techniques to several motif finders to show that optimizing E-values rather than the entropy score can significantly improve their performance. Extending this idea to other motif models and scoring functions is an interesting avenue for future research.

This talk is based on joint work with Uri Keich, Neil Jones and Patrick Ng.

(This is a postdoctoral candidate talk.)

2 p.m. Thursday October 12, 2006

Title: Genomic analysis of the biomass conversion systems of the marine bacterium Saccharophagus degradans 2-40
Speaker: Steven W. Hutcheson, Ph.D.
Venue: Biomolecular Science Building Room 3118

Saccharophagus degradans 2-40 (Sde2-40) is an aerobic, gamma subgroup proteobacterium (order, Alteromonadales) that can rapidly decompose diverse whole plant materials as well as cellulosic biomass in monoculture. It expresses multiple enzyme systems to degrade at least 11 different complex polysaccharides (CPs), including agar (agarose), alginate, cellulose, chitin, fucoidan, laminarin, mixed b-glucans, pectin, pullulan, starch and xylan. It also synthesizes several proteases and lipases. To identify the genes for the functional carbohydrases acting on these complex carbohydrases, the complete Sde2-40 genome sequence was determined by DOE -JGI. The 5.05 Mb genome encoded 4009 gene models, a comparatively low gene density for this group of bacteria. At least 111 gene models were identified that either contained a homolog of a known glycoside hydrolase (GH) domain and/or a carbohydrate-binding module (CBM) typical of carbohydrases. Collectively, 31 different classes of GH domains were identified in the predicted carbohydrases. Through genetic, proteomic and biochemical analyses, functional elements of agarolytic, chitinolytic and cellulolytic systems have been characterized. Each of these environmentally regulated systems utilizes freely secreted and surface-associated enzymes to degrade the substrate and vector mono-, di-, and oligo-saccharide products to the cell through strategic placement of enzymes. Freely secreted enzymes of each system tend to be endo-acting enzymes with multiple CBMs. At least one enzyme in each degradative system appears to be an epicellular lipoprotein that has been demonstrated or is predicted to be exo-acting enzyme. A phosphorylic pathway for cellulose degradation is proposed. Many of these enzymes appear to have been acquired by the naturally competent Sde2 -40 through horizontal gene transfer or by domain shuffling. Several superintegrons and associated satellites were also identified in the genome that involve 200 kb or more of the genome and appear to consist of recently acquired DNA fragments.

6 p.m. Tuesday October 17, 2006

Title: NCBI's RefSeq and Entrez Gene: a case study
Speaker: Donna Maglott, Ph.D. (NIH/NLM/NCBI)
Venue: Computer Science Instructional Center Room 2118

NCBI's Reference Sequence (RefSeq) collection is designed to provide a set of standard, non-redundant sequences of genomes, RNAs and proteins of major research organisms. These sequences are annotated as appropriate with major features of interest, including genes, mRNAs, and coding regions. An early consequence of the RefSeq project, therefore, was the development of methods to identify and track genes and their attributes with each sequence update. First made public as LocusLink, gene-specific data are now reported through Entrez Gene. This talk will provide a brief history of RefSeq, LocusLink, and Entrez Gene. Current data flows will be discussed, including (1) gene definition from expressed sequences vs. from genomic annotation, (2) integration of gene-specific attributes from public data bases, and (3) curation vs. computation.

http://www.ncbi.nlm.nih.gov/RefSeq/ http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene

(This is a guest talk at CMSC828U.)


Dr. Maglott earned her Ph.D. in 1971 at the University of Michigan. Her dissertation was on the structure and function of the 50S subunit of the E. coli ribosome. After an extensive post-doc in early child care, she had an academic position at Howard University where she worked on the proteomics of that time, namely looking for changes in protein synthesis in early sea urchin development via 2D gel electrophoresis. In 1986, she accepted a position at the American Type Culture Collection in Rockville, MD. where she got in on the ground floor of database development supporting genomic research. Her major functions there was to develop and maintain relational databases to describe molecular reagents (probes, vectors, recombinant hosts, clones) and their targets (genes, loci, and polymorphisms). In 1998 she joined the staff of NCBI, where she developed the databases to track the processing of sequences for the RefSeq project, and to capture gene-specific attributes. Although her primary responsibilities are currently Entrez Gene and RefSeq support, she also contributes to NCBI's genome annotation pipelines, Map Viewer and hosting OMIM.

2 p.m. Thursday October 19, 2006

Title: Programmed ribosomal frameshifting: it's not just for viruses any more
Speaker: Jonathan Dinman, Ph.D.
Venue: Biomolecular Science Building Room 3118

In viruses, programmed -1 ribosomal frameshifting (-1 PRF) signals direct the translation of alternative proteins from a single mRNA. Given that many basic regulatory mechanisms were first discovered in viral systems, the current study endeavored to: 1) identify -1 PRF signals in genomic databases, 2) apply the protocol to the yeast genome, and 3) test selected candidates at the bench. Computational analyses revealed the presence of 10,340 consensus -1 PRF signals in the yeast genome. Of the 6,353 yeast ORFs, 1,275 contain at least one strong and statistically significant -1 PRF signal. Eight out of nine selected sequences promoted efficient levels of PRF in vivo. These findings provide a robust platform for high throughput computational and laboratory studies and demonstrate that functional -1 PRF signals are widespread in the genome of S. cerevisiae. The data generated by this study have been deposited into a publicly available database called the PRFdb. The presence of stable mRNA pseudoknot structures in these -1 PRF signals, and the observation that the predicted outcomes of nearly all of these genomic frameshift signals would direct ribosomes to premature termination codons, suggest two possible mRNA destabilization pathways through which -1 PRF signals could post-transcriptionally regulate mRNA abundance.

2 p.m. Thursday October 26, 2006

Title: Designing Tools for cDNA-to-Genome Alignment
Speaker: Liliana Florea, Ph.D. (George Washington University)
Venue: Biomolecular Science Building Room 3118

Accurately and efficiently aligning cDNA sequences to a whole genome, either from the same species or from a close relative, is a critical component of any gene annotation project. We start by presenting our work in designing cDNA-to-genome alignment programs to address these needs. One important choice in designing alignment programs is the selection of seeds, with spaced seeds recently emerging as the primary vehicle for increasing alignment sensitivity. We describe our preliminary efforts in selecting mathematically sensitive and specific spaced seeds, starting from codon and mutation-sensitive models of alignments and sequences, and suggest how they can be used to increase the accuracy of cDNA-to-genome alignment programs.


Dr. Liliana Florea is an Assistant Professor in the Computer Science Department at the George Washington University, with specialty in Computational Biology and Bioinformatics, and a member of the Biochemistry Department faculty at the GWU Medical School. Prior to joining the George Washington University in 2005 she was a Senior Scientist at Celera Genomics and Applied Biosystems. Her research and interests revolve around applying sequence analysis techniques to genome comparison, automatic gene annotation, comparative genomics, analysis of alternative splicing and its regulation, and computational vaccine design. She holds a Ph.D. degree in Computer Science and Engineering from the Penn State University (2000).

2 p.m. Thursday November 2, 2006

Title: Randomized Motion Planning: From Intelligent CAD to Computer Animation to Protein Folding
Speaker: Nancy Amato, Ph.D. (Texas A&M)
Venue: A.V. Williams Building Room 3258

Motion planning arises in many application domains such as computer animation (digital actors), mixed reality systems and intelligent CAD (virtual prototyping and training), and even computational biology and chemistry (protein folding and drug design). Surprisingly, a single class of planners, called probabilistic roadmap methods (PRMs), have proven effective on problems from all these domains. Strengths of PRMs, in addition to versatility, are simplicity and efficiency, even in high-dimensional configuration spaces.

In this talk, we describe the PRM framework and give an overview of several PRM variants developed in our group. We describe in more detail our work related to virtual prototyping, computer animation, and protein folding. For virtual prototyping, we show that in some cases a hybrid system incorporating both an automatic planner and haptic user input leads to superior results. For computation animation, we describe new PRM-based techniques for planning sophisticated group behaviors such as flocking and herding. Finally, we describe our application of PRMs to simulate molecular motions, such as protein and RNA folding. More information regarding our work, including movies, can be found at http://parasol.tamu.edu/~amato/.


Nancy M. Amato is a professor of Computer Science at Texas A&M University. She received B.S. and A.B. degrees in Mathematical Sciences and Economics, respectively, from Stanford University, and M.S . and Ph.D. degrees in Computer Science from UC Berkeley and the University of Illinois at Urbana-Champaign, respectively. She was an AT&T Bell Laboratories PhD Scholar, she is a recipient of a CAREER Award from the National Science Foundation, and she is a Distinguished Lecturer for the IEEE Robotics and Automation Society. She served as an Associate Editor of the IEEE Transactions on Robotics and Automation and of the IEEE Transactions on Parallel and Distributed Systems, she serves on review panels for NIH and NSF, and she regularly serves on conference organizing and program committees. She is a member of the Computing Research Association's Committee on the Status of Women in Computing Research (CRA-W) and she co-directs the CRA-W's Distributed Mentor Program (http://www.cra.org/Activities/craw/dmp/).

Her main areas of research focus are motion planning, computational biology and geometry, and high-performance computing. Current projects include the development of a new technique for approximating protein folding pathways and energy landscapes, and STAPL, a parallel C++ library enabling the development of efficient, portable parallel programs.

2 p.m. Thursday November 9, 2006

Title: Finding Motifs in Sequence Data: Application to Splice Site Prediction
Speaker: Rezarta Islamaj
Venue: Biomolecular Science Building Room 3118

Sequence data in most domains contains useful 'signals' or features that enable the correct construction of classification algorithms. Extracting and interpreting these features is a difficult problem. In the first part of the talk I will review our approach to feature generation in sequence data. This is an integrated process, which allows us to systematically search a large space of potential features. We show that predictive models built using our feature generation algorithm for splice site prediction achieve a significant improvement in accuracy over existing state-of-the-art approaches.

Achieving a good performance is one criteria for evaluation; a more important aspect is understanding the signals which play the important roles. In the second part of the talk I will present our ongoing work on feature browsing/visualization tool. With this tool the user can view and explore different subsets of features that are generated by our method. Each of the identified feature sets can be easily searched, ranked and displayed. For each group, the user can browse the discovered clusters. We show examples of the observed clusters and describe our preliminary efforts to detect biological signals that may be important for the splicing process.

2 p.m. Thursday November 30, 2006

Title: Investigations of multipartite Rhizobiaceae genomes
Speaker: João Carlos Setubal, Ph.D. (Virginia Bioinformatics Institute)
Venue: Biomolecular Science Building Room 3118

The Rhizobiaceae are a subgroup of alphaproteobacteria that includes the genera Agrobacterium and Rhizobium. Recently two new Rhizobium genomes have been published (R. etli and R. leguminosarum). Two new Agrobacterium genomes (A. vitis and A. radiobacter) will soon be published. An interesting feature of the Rhizobiaceae is that it includes both plant pathogens and symbionts. In addition, and perhaps not uncoincidentally, these genomes have an unusual architecture, with one chromosome and several large secondary replicons or plasmids. Sequence comparison of these replicons shows that the chromosomes share a clear backbone, but the evolutionary history of the secondary large replicons is much less clear, and therefore presents an interesting challenge for ancestral sequence reconstruction. In this talk I will describe preliminary results on these comparisons and discuss my attempts at inferring the evolutionary events leading to the present genomic configuration.


João Setubal is associate professor and deputy director at the Virginia Bioinformatics Institute and associate professor in Virginia Tech's Department of Computer Science. He received his Ph.D. in computer science in 1992 from the University of Washington. Before joining VBI, Setubal served as an assistant and associate professor at the University of Campinas' Institute of Computing in Brazil from 1992 to 2004 and was a visiting research scholar in the Department of Genome Sciences at the University of Washington from 2000 to 2001.

Setubal's research interests are in the area of computational tools for genome annotation and analysis. Since 1997, he has worked primarily in the areas of bioinformatics support and analysis of bacterial genome projects, including Xylella, Xanthomonas, Leptospira, and Leifsonia. He has led the development of tools of various kinds, such as a genome contig scaffolder, a bacterial genome annotation system, and database models for genomic data. Some of his active projects include genome annotation for the Agrobacterium and Azotobacter sequencing projects, and gene ontology term development for plant-associated microbes. In addition, he leads efforts for VBI's PATRIC (PathoSystems Resource Integration Center) project, a large genomics database initiative funded by the National Institutes for Allergic and Infectious Diseases.

2 p.m. Thursday December 7, 2006

Title: Systems approach for understanding cellular responses to gamma radiation in Halobacterium NRC-1 using Cytoscape
Speaker: Bo Liu
Venue: Biomolecular Science Building Room 3118

Organisms of the phylogenetic domain Archaea are environmentally ubiquitous and typically represent ~10% of the microbiota. However in extreme environments, such as high temperature or salinity, archaea dominate the microbial population. Halobaterium sp. NRC-1 is highly resistant to gamma radiation and is able to repair extensive double strand DNA breaks (DSBs) in its genomic DNA produced by gamma radiation. But from its genomic sequence and previous research, no novel proteins, factors or pathways have been reported that may account for this unique property. Systems approaches enable the elucidation of global physiological responses to gamma radiation. We have attempted to address this issue through a systems level study of Halobacterium NRC-1 response to gamma radiation using whole genome mRNA microarray analysis using Cytoscape.

5:30 p.m. Tuesday December 12, 2006

Title: Provenance in Scientific Workflows: ZOOM with user views
Speaker: Sarah Cohen-Boulakia (University of Pennsylvania)
Venue: Computer Science Instructional Center Room 3118

Scientific experiments are becoming increasingly large and complex, with a commensurate increase in the amount and complexity of data that is generated. Data, both intermediate and final results, is derived by chaining and nesting together multiple database searches and analytical tools. In many cases, the means by which the data are produced is not known, making the data difficult to interpret and the experiment impossible to reproduce. Provenance in scientific workflows is thus of paramount importance. ZOOM*UserViews presents a formal model of provenance for scientific workflows that is simple, generic, and yet sufficiently expressive to answer questions of data and step provenance that have been encountered in a large variety of scientific case studies. In addition, ZOOM builds on the concept of composite step-classes -- or sub-workflows -- which is present in many scientific workflow systems to develop a notion of user views. This talk discusses the design and implementation of ZOOM in the context of queries encountered in a number of case studies and posed by the first provenance challenge. We will show how user views affect the level of granularity at which provenance information can be seen and reasoned about.


Sarah Cohen-Boulakia is a post-doctoral researcher at the University of Pennsylvania where she works with Prof. Susan Davidson. She defended her PhD in Computer Science in 2005, under the supervision of Prof. Ch. Froidevaux at the Laboratoire de Recherche en Informatique, University of Paris-Sud 11, France. Dr. Cohen Boulakia's research interests are in the design and application of integration systems dedicated to biological and biomedical domain. She is best known for her work on BioGuide and techniques for supporting biologists navigate the maze of biological resources available over the web. In this work, she collaborates closely with biologists, physicians, and computer scientists. More information at http://www.seas.upenn.edu/~sarahcb.