CBCB Seminar Series
Fall 2008
2 p.m. Thursday September 4, 2008
Title: organizational meeting
By: Stephen M. Mount, Ph.D.
Venue: Biomolecular Science
Building Room 3118
Abstract: To discuss the schedule
in Fall 2008.
2 p.m. Thursday September 18, 2008
Title: Assembly and mapping of
Bos taurus genome
By:Aleksey Zimin
Venue: Biomolecular Science
Building Room 3118
Abstract:
A genome of cow (Bos taurus) was recently sequenced and assembled
by Baylor College of Medicine (BCM) Human Genome Sequencing Center (HGSC).
The BCM's latest draft is called Btau4.0. We produced an alternative
version of the assembly from the public data using UMD Overlapper and
Celera Assembler 4. Our latest draft is called UMD_Freeze_1.7.
UMD_Freeze_1.7 places almost 7% more sequence onto the chromosomes and
fixes a number of large inversions/omissions that are present in Btau4.0
and are independentely verified by our collaborators. In this talk the
procedures used to create the assembly and map the assembled scaffolds
onto the chromosomes will be described and a comparison of the two draft
assemblies will be presented.
2 p.m. Thursday October 16, 2008
Title: An Overview of
Bioinformatics Research at Humboldt
By: Silke Trissl
(Humboldt-Universitat zu Berlin)
Venue: Biomolecular Science
Building Room 3118
Abstract:
The Knowledge Management
Group in the Computer Science Department at Humboldt-University, Berlin
is headed by Ulf
Leser. The group has several research interests:
- Data integration - Columba and Aladin
- Text mining - Alibaba
- Querying biological networks - GRIPP
- Protein function prediction using biological networks
- Phylogeny of languages
In the talk I will give a brief overview over projects running at the
Knowledge Management Group. I will go into depths for two main research
interests - data integration and protein function prediction.
Data integration
Columba is a database that integrates data about protein structures. When
a user poses a query, she might expect the search results to be ranked. In
Columba we use a ranking that is based on the following observation:
different sources contain information about the same types of biological
facts. For example the data sources KEGG, aMAZE, and Reactome contain
information about metabolic pathways. These three data sources overlap to
a certain degree, but also contain diverse data. Querying all three data
sources will result in results supported by one, two, or all three data
sources. We call data sources with the similar content dimensions. In a
setting with many dimensions the ranking of search results is therefore
important. We developed two scores, namely the confidence and
surprisingess score to rank search results (work presented at DILS 2007).
Protein function prediction
The second point I will focus on is protein function prediction using
biological networks and text mining methods. For human, mice, yeast, fruit
fly, and arabidopsis protein-protein interaction data are known. Finding
common subgraphs in those networks allows to transfer functions of a
protein in one network to the corresponding protein in the other network.
Using also the network information Samira Jaeger, a PhD student in our
group, could show that the prediction accuracy is better than using just
sequence information. To enhance the precision of the results, they
furthermore implemented a procedure that validates all predictions based
on findings reported in the literature. (work presented at DILS 2008).
Biography:
Silke Trissl is a PhD student at Humboldt-University Berlin, Germany. Her
supervisor is Ulf Leser, who holds the chair for Knowledge Management in
Bioinformatiks in the Computer Science Department. From 1996 to 2001 Silke
studied Biotechnology at the University of Applied Sciences in
Weihenstephan, Germany, which she finished with a diploma. In 2002 she
received a MSc in Bioinformatics at the University of Ulster, UK. She
joined the group of Professor Leser in 2003 and worked on two projects,
Columba and graph querying.
2 p.m. Thursday October 23, 2008
Title: Uncovering reassortments
among I$
By: Niranjan
Nagarajan, Ph.D.
Venue: Biomolecular Science
Building Room 3118
Abstract:
The Influenza genome is divided into 8 segments and co-infection of a host
by multiple strains can lead to a novel reassorted strain with segments
borrowed from different parts of the evolutionary tree. This jump in
evolution has been implicated in the creation of several pandemic strains
such as the Hong Kong flu of 1968. Routine sequencing of newly collected
Influenza strains has led to a wealth of genomic information; the
detection of reassortant strains however has still relied on heuristic
analysis of segment phylogenies. Here we present an automated approach to
reliably discover reassortment events, even in the presence of uncertain
phylogeny. Our approach is based on translating the problem into one of
finding large bicliques in a bipartite graph and in this context we
present the first quadratic delay algorithm for enumerating all maximal
bicliques in a bipartite graph. Experiments on real and simulated datasets
have shown the utility of this approach in detecting known as well as
subtle novel reassortment candidates.
This is joint work with Carl Kingsford.
2 p.m. Thursday October 30, 2008
Title: "Bowtie: A Highly
Scalable Tool for Post-Genomics Datasets"
By: Ben Langmead
Venue: Biomolecular Science
Building Room 3118
Abstract:
Improvements in DNA sequencing have broadened its applications and
increased the size of sequencing datasets. As costs decrease and adoption
increases, demand for more sequencing data is likely to grow
multiplicatively. Cases in point are the 1000-Genomes and Human
Microbiome Projects. Multiplicative dataset growth creates a grave need
for scalable algorithms to extract biological evidence from sequencing
data. I will discuss Bowtie, a tool developed at CBCB that applies a
novel, scalable algorithm to rapidly align short reads to mammalian
genomes for resequencing. We recently used Bowtie to align 14.3x coverage
worth of human Illumina reads from the 1000 Genomes project in a single
overnight (14 hours) on a PC with 4 processor cores. I will also present
an idea for how Bowtie's technique can be applied to indexing and querying
large collections of metagenomic data, as needed by the Human Microbiome
Project.
2 p.m. Thursday November 6, 2008
Title: Evaluating Relevance
Ranking and Query Expansion for MEDLINE Retrieval
Speaker: Zhiyong Lu, Ph.D.
(NIH/NLM/NCBI)
Venue: Biomolecular Science
Building Room 3118
Abstract:
A number of techniques have been researched over the last 40 years in
efforts to improve retrieval effectiveness in the field of information
retrieval (IR). We present our investigation on two specific influential
techniques: relevance ranking and query expansion. Systems employ
relevance ranking techniques attempt to sort retrieved documents based on
such measures as term weighting and term proximity to display more
relevant documents earlier. Query expansion refers to the process of
reformulating a query to improve retrieval performance, typically by
including additional synonymous terms. Although both techniques are
traditionally known to be effective in the general IR domain, no prior
work exists with regard to their value in the context of MEDLINE
retrieval. We performed two separate evaluation studies, one for each
technique. Both studies were conducted using the 2006 and 2007 TREC
Genomics Tracks data comprising real biological questions and
independently judged relevant documents. Based on our experimental
results, we conclude that both techniques can result in better results in
terms of selected IR measures such as mean rank precisions. However, this
type of improvement may not prove highly useful for those users looking
only at top ranked returned documents.
Biography:
Zhiyong Lu is a Staff Scientist at the National Center for Biotechnology
Information, where he joined right after earning a PhD in Bioinformatics
at the University of Colorado School of Medicine. His current research has
focused on the problems of helping researchers find the specific
publications that are relevant to their work, and having found those
documents, then making that (sometimes very large) body of text manageable
for them. He has published on these subjects as well as on matters related
to automatically predicting protein subcellular localization.
2 p.m. Thursday November 13, 2008
Title: Mining Complexity: An
analysis o$scales.
By: Tara Gianoulis
Venue: Biomolecular Science
Building Room 3118
2 p.m. Thursday November 20, 2008
Title: Autism & Vaccines: How Bad Science Confuses the Press &
Harms the Public
By: Steven Salzberg, Ph.D.
Venue: Biomolecular Science
Building Room 3118
Abstract:
Ten years ago, an article appeared in the medical journal The Lancet that
suggested a link between autism and the vaccine for measles, mumps, and
rubella. The article was widely cited in the popular press in England, and
vaccination rates began to fall. Further investigations revealed that the
data in the study had been manipulated, and that the principal scientist
had a major conflict of interest, with the result that 10 of his 12
co-authors repudiated the study's findings.
Numerous scientific studies since 1998, all done in response to the
original Lancet article, have failed to find any link between autism and
vaccines. Despite this, a few scientists and doctors continue to push the
connection, often accompanying their claims with promises of "alternative"
treatments for autism. The press keeps the issue alive by reporting "the
controversy," often accompanying their reports with emotional testimonials
from parents, including several celebrities. As a consequence of this
publicity, vaccination rates are now falling in the United States, leading
to alarming new outbreaks of diseases.
Scientists and skeptics need to act to quell the rumors and educate the
public, so that vaccines, one of the greatest medical successes in
history, remain an effective tool in our fight against disease.
This talk was also be presented to the National Capital Area Skeptics on Nov. 8.
2 p.m. Thursday December 11, 2008
Title: Metagenomics research at
the CBCB
By: Mihai Pop, Ph.D.
Venue: Biomolecular Science
Building Room 3118
Abstract:
Metagenomics is a new scientific field aimed at uncovering the diversity
of microbial life on Earth and the contributions of microbes to our
health and to the health of our environment. In my talk I will present
several of our recent results on the development of new analysis tools
for metagenomic data. I will also provide a survey of other ongoing
research on metagenomics at the CBCB and highlight future plans.
2 p.m. Friday December 12, 2008
The Dissertation Defense for the Degree
of Ph.D.
Title: A framework for discovering
meaningful associations in the annotated life sciences web
By: Woei-jyh (Adam) Lee
Venue: Biomolecular Science
Building Room 3118
Abstract:
During the last decade, life sciences researchers have gained access to
the entire human genome, reliable high-throughput biotechnologies,
affordable computational resources, and public network access. This has
produced vast amounts of data and knowledge captured in the life sciences
Web, and has created the need for new tools to analyze this knowledge and
make discoveries. Consider a simplified Web of three publicly accessible
data resources Entrez Gene, PubMed and OMIM. Data records in each resource
are annotated with terms from multiple controlled vocabularies (CVs). The
links between data records in two resources form a relationship between
the two resources. Thus, a record in Entrez Gene, annotated with GO terms,
can have links to multiple records in PubMed that are annotated with MeSH
terms. Similarly, OMIM records annotated with terms from SNOMED CT may
have links to records in Entrez Gene and PubMed. This forms a rich web of
annotated data records.
The objective of this research is to develop the Life Science Link
(LSLink) methodology and tools to discover meaningful patterns
across resources and CVs. In a first step, we execute a protocol to follow
links, extract annotations, and generate datasets of termlinks, which
consist of data records and CV terms. We then mine the termlinks of the
datasets to find potentially meaningful associations between pairs of
terms from two CVs. Biologically meaningful associations of pairs of CV
terms may yield innovative nuggets of previously unknown knowledge.
Moreover, the bridge of associations across CV terms will reflect the
practice of how scientists annotate data across linked data repositories.
Contributions include a methodology to create background datasets, metrics
for mining patterns, applying semantic knowledge for generalization, tools
for discovery, and validation with biological use cases.
Inspired by research in association rule mining and linkage analysis, we
develop two metrics to determine support and confidence scores in the
associations of pairs of CV terms. Associations that have a statistically
significant high score and are biologically meaningful may lead to new
knowledge. To further validate the support and confidence metrics, we
develop a secondary test for significance based on the hypergeometric
distribution. We also exploit the semantics of the CVs. We aggregate
termlinks over siblings of a common parent CV term and use them as
additional evidence to boost the support and confidence scores in the
associations of the parent CV term. We provide a simple discovery
interface where biologists can review associations and their scores.
Finally, a cancer informatics use case validates the discovery of
associations between human genes and diseases.
Examining Committee:
Committee Chair: Dr. Louiqa
Raschid
Dean's Representative: Dr.
Stephen M. Mount
Committee Members: Dr. Mihai
Pop, Dr. Carl Kingsford,
Dr. Jimmy Lin
|
|