CBCB Seminar Series

Fall 2008

2 p.m. Thursday September 4, 2008

Title: organizational meeting
By: Stephen M. Mount, Ph.D.
Venue: Biomolecular Science Building Room 3118
Abstract: To discuss the schedule in Fall 2008.

2 p.m. Thursday September 18, 2008

Title: Assembly and mapping of Bos taurus genome
By:Aleksey Zimin
Venue: Biomolecular Science Building Room 3118
A genome of cow (Bos taurus) was recently sequenced and assembled by Baylor College of Medicine (BCM) Human Genome Sequencing Center (HGSC). The BCM's latest draft is called Btau4.0. We produced an alternative version of the assembly from the public data using UMD Overlapper and Celera Assembler 4. Our latest draft is called UMD_Freeze_1.7. UMD_Freeze_1.7 places almost 7% more sequence onto the chromosomes and fixes a number of large inversions/omissions that are present in Btau4.0 and are independentely verified by our collaborators. In this talk the procedures used to create the assembly and map the assembled scaffolds onto the chromosomes will be described and a comparison of the two draft assemblies will be presented.

2 p.m. Thursday October 16, 2008

Title: An Overview of Bioinformatics Research at Humboldt
By: Silke Trissl (Humboldt-Universitat zu Berlin)
Venue: Biomolecular Science Building Room 3118

The Knowledge Management Group in the Computer Science Department at Humboldt-University, Berlin is headed by Ulf Leser. The group has several research interests:
  • Data integration - Columba and Aladin
  • Text mining - Alibaba
  • Querying biological networks - GRIPP
  • Protein function prediction using biological networks
  • Phylogeny of languages
In the talk I will give a brief overview over projects running at the Knowledge Management Group. I will go into depths for two main research interests - data integration and protein function prediction.

Data integration

Columba is a database that integrates data about protein structures. When a user poses a query, she might expect the search results to be ranked. In Columba we use a ranking that is based on the following observation: different sources contain information about the same types of biological facts. For example the data sources KEGG, aMAZE, and Reactome contain information about metabolic pathways. These three data sources overlap to a certain degree, but also contain diverse data. Querying all three data sources will result in results supported by one, two, or all three data sources. We call data sources with the similar content dimensions. In a setting with many dimensions the ranking of search results is therefore important. We developed two scores, namely the confidence and surprisingess score to rank search results (work presented at DILS 2007).

Protein function prediction

The second point I will focus on is protein function prediction using biological networks and text mining methods. For human, mice, yeast, fruit fly, and arabidopsis protein-protein interaction data are known. Finding common subgraphs in those networks allows to transfer functions of a protein in one network to the corresponding protein in the other network. Using also the network information Samira Jaeger, a PhD student in our group, could show that the prediction accuracy is better than using just sequence information. To enhance the precision of the results, they furthermore implemented a procedure that validates all predictions based on findings reported in the literature. (work presented at DILS 2008).


Silke Trissl is a PhD student at Humboldt-University Berlin, Germany. Her supervisor is Ulf Leser, who holds the chair for Knowledge Management in Bioinformatiks in the Computer Science Department. From 1996 to 2001 Silke studied Biotechnology at the University of Applied Sciences in Weihenstephan, Germany, which she finished with a diploma. In 2002 she received a MSc in Bioinformatics at the University of Ulster, UK. She joined the group of Professor Leser in 2003 and worked on two projects, Columba and graph querying.

2 p.m. Thursday October 23, 2008

Title: Uncovering reassortments among I$
By: Niranjan Nagarajan, Ph.D.
Venue: Biomolecular Science Building Room 3118

The Influenza genome is divided into 8 segments and co-infection of a host by multiple strains can lead to a novel reassorted strain with segments borrowed from different parts of the evolutionary tree. This jump in evolution has been implicated in the creation of several pandemic strains such as the Hong Kong flu of 1968. Routine sequencing of newly collected Influenza strains has led to a wealth of genomic information; the detection of reassortant strains however has still relied on heuristic analysis of segment phylogenies. Here we present an automated approach to reliably discover reassortment events, even in the presence of uncertain phylogeny. Our approach is based on translating the problem into one of finding large bicliques in a bipartite graph and in this context we present the first quadratic delay algorithm for enumerating all maximal bicliques in a bipartite graph. Experiments on real and simulated datasets have shown the utility of this approach in detecting known as well as subtle novel reassortment candidates.

This is joint work with Carl Kingsford.

2 p.m. Thursday October 30, 2008

Title: "Bowtie: A Highly Scalable Tool for Post-Genomics Datasets"
By: Ben Langmead
Venue: Biomolecular Science Building Room 3118

Improvements in DNA sequencing have broadened its applications and increased the size of sequencing datasets. As costs decrease and adoption increases, demand for more sequencing data is likely to grow multiplicatively. Cases in point are the 1000-Genomes and Human Microbiome Projects. Multiplicative dataset growth creates a grave need for scalable algorithms to extract biological evidence from sequencing data. I will discuss Bowtie, a tool developed at CBCB that applies a novel, scalable algorithm to rapidly align short reads to mammalian genomes for resequencing. We recently used Bowtie to align 14.3x coverage worth of human Illumina reads from the 1000 Genomes project in a single overnight (14 hours) on a PC with 4 processor cores. I will also present an idea for how Bowtie's technique can be applied to indexing and querying large collections of metagenomic data, as needed by the Human Microbiome Project.

2 p.m. Thursday November 6, 2008

Title: Evaluating Relevance Ranking and Query Expansion for MEDLINE Retrieval
Speaker: Zhiyong Lu, Ph.D. (NIH/NLM/NCBI)
Venue: Biomolecular Science Building Room 3118

A number of techniques have been researched over the last 40 years in efforts to improve retrieval effectiveness in the field of information retrieval (IR). We present our investigation on two specific influential techniques: relevance ranking and query expansion. Systems employ relevance ranking techniques attempt to sort retrieved documents based on such measures as term weighting and term proximity to display more relevant documents earlier. Query expansion refers to the process of reformulating a query to improve retrieval performance, typically by including additional synonymous terms. Although both techniques are traditionally known to be effective in the general IR domain, no prior work exists with regard to their value in the context of MEDLINE retrieval. We performed two separate evaluation studies, one for each technique. Both studies were conducted using the 2006 and 2007 TREC Genomics Tracks data comprising real biological questions and independently judged relevant documents. Based on our experimental results, we conclude that both techniques can result in better results in terms of selected IR measures such as mean rank precisions. However, this type of improvement may not prove highly useful for those users looking only at top ranked returned documents.


Zhiyong Lu is a Staff Scientist at the National Center for Biotechnology Information, where he joined right after earning a PhD in Bioinformatics at the University of Colorado School of Medicine. His current research has focused on the problems of helping researchers find the specific publications that are relevant to their work, and having found those documents, then making that (sometimes very large) body of text manageable for them. He has published on these subjects as well as on matters related to automatically predicting protein subcellular localization.

2 p.m. Thursday November 13, 2008

Title: Mining Complexity: An analysis o$scales.
By: Tara Gianoulis
Venue: Biomolecular Science Building Room 3118

2 p.m. Thursday November 20, 2008

Title: Autism & Vaccines:
How Bad Science Confuses the Press & Harms the Public
By: Steven Salzberg, Ph.D.
Venue: Biomolecular Science Building Room 3118

Ten years ago, an article appeared in the medical journal The Lancet that suggested a link between autism and the vaccine for measles, mumps, and rubella. The article was widely cited in the popular press in England, and vaccination rates began to fall. Further investigations revealed that the data in the study had been manipulated, and that the principal scientist had a major conflict of interest, with the result that 10 of his 12 co-authors repudiated the study's findings.

Numerous scientific studies since 1998, all done in response to the original Lancet article, have failed to find any link between autism and vaccines. Despite this, a few scientists and doctors continue to push the connection, often accompanying their claims with promises of "alternative" treatments for autism. The press keeps the issue alive by reporting "the controversy," often accompanying their reports with emotional testimonials from parents, including several celebrities. As a consequence of this publicity, vaccination rates are now falling in the United States, leading to alarming new outbreaks of diseases.

Scientists and skeptics need to act to quell the rumors and educate the public, so that vaccines, one of the greatest medical successes in history, remain an effective tool in our fight against disease.

This talk was also be presented to the National Capital Area Skeptics on Nov. 8.

2 p.m. Thursday December 11, 2008

Title: Metagenomics research at the CBCB
By: Mihai Pop, Ph.D.
Venue: Biomolecular Science Building Room 3118

Metagenomics is a new scientific field aimed at uncovering the diversity of microbial life on Earth and the contributions of microbes to our health and to the health of our environment. In my talk I will present several of our recent results on the development of new analysis tools for metagenomic data. I will also provide a survey of other ongoing research on metagenomics at the CBCB and highlight future plans.

2 p.m. Friday December 12, 2008

The Dissertation Defense for the Degree of Ph.D.
Title: A framework for discovering meaningful associations in the annotated life sciences web
By: Woei-jyh (Adam) Lee
Venue: Biomolecular Science Building Room 3118

During the last decade, life sciences researchers have gained access to the entire human genome, reliable high-throughput biotechnologies, affordable computational resources, and public network access. This has produced vast amounts of data and knowledge captured in the life sciences Web, and has created the need for new tools to analyze this knowledge and make discoveries. Consider a simplified Web of three publicly accessible data resources Entrez Gene, PubMed and OMIM. Data records in each resource are annotated with terms from multiple controlled vocabularies (CVs). The links between data records in two resources form a relationship between the two resources. Thus, a record in Entrez Gene, annotated with GO terms, can have links to multiple records in PubMed that are annotated with MeSH terms. Similarly, OMIM records annotated with terms from SNOMED CT may have links to records in Entrez Gene and PubMed. This forms a rich web of annotated data records.

The objective of this research is to develop the Life Science Link (LSLink) methodology and tools to discover meaningful patterns across resources and CVs. In a first step, we execute a protocol to follow links, extract annotations, and generate datasets of termlinks, which consist of data records and CV terms. We then mine the termlinks of the datasets to find potentially meaningful associations between pairs of terms from two CVs. Biologically meaningful associations of pairs of CV terms may yield innovative nuggets of previously unknown knowledge. Moreover, the bridge of associations across CV terms will reflect the practice of how scientists annotate data across linked data repositories. Contributions include a methodology to create background datasets, metrics for mining patterns, applying semantic knowledge for generalization, tools for discovery, and validation with biological use cases.

Inspired by research in association rule mining and linkage analysis, we develop two metrics to determine support and confidence scores in the associations of pairs of CV terms. Associations that have a statistically significant high score and are biologically meaningful may lead to new knowledge. To further validate the support and confidence metrics, we develop a secondary test for significance based on the hypergeometric distribution. We also exploit the semantics of the CVs. We aggregate termlinks over siblings of a common parent CV term and use them as additional evidence to boost the support and confidence scores in the associations of the parent CV term. We provide a simple discovery interface where biologists can review associations and their scores. Finally, a cancer informatics use case validates the discovery of associations between human genes and diseases.

Examining Committee:

Committee Chair: Dr. Louiqa Raschid
Dean's Representative: Dr. Stephen M. Mount
Committee Members: Dr. Mihai Pop, Dr. Carl Kingsford, Dr. Jimmy Lin