CBCB Seminar Series
Spring 2009
2 p.m. Thursday February 5, 2009
Title:
Improved assembly of the Bos taurus genome.
By: Aleksey Zimin, Institute for
Physical Science and Technology, University of Maryland
Venue: Biomolecular Science
Building Room 3118
Abstact:
A genome of cow (Bos taurus) was recently sequenced and assembled by
Baylor College of Medicine (BCM) Human Genome Sequencing Center
(HGSC). The BCM's latest draft is called Btau4.2. We produced an
independent assembly from the public Trace Archive
Data using a variety of methods, including the Celera Assembler,
the UMD Overlapper, and additional assembly debugging, mapping, and
improvement
tools. We used publicly available map data to map the scaffolds onti the
chromosomes. Our latest draft is
called Bos_taurus_UMD_2.0 and it was released in November 2008.
Bos_taurus_UMD_2.0 places almost 6% more sequence onto the chromosomes and
fixes a number of large inversions/omissions that are present in
Btau4.2 and are independentely verified by our collaborators.
In this talk the two assemblies will be compared on a variety of
criteria including quantitative measures, agreement to the published
maps and amount of coding sequence present. Procedures used to create
the assembly and map the assembled scaffolds onto the chromosomes will
be described briefly.
Our assembly is publicly available and it is posted on our ftp site
ftp.cbcb.umd.edu/pub/data/assembly/Bos_taurus/
Also: A brief meeting to discuss
the schedule
for Spring 2009.
2 p.m. Friday February 6, 2009
(UMIACS Special Seminar)
Title: Achieving Anonymity in
Clinical Genomics Databases: Is it Possible?
By: Bradley A. Malin, Ph.D.,
Vanderbilt University
Venue: A.V. Williams
Building Room 3258
Abstract:
For years, medical researchers have been directed to de-identify patients'
health records and biological data before such information is shared
beyond the collecting institution. This policy is reinforced by
Institutional Review Boards, as well as regulations at the state and
federal level, such as the Privacy Rule of the Health Insurance
Portability and Accountability Act. De-identified data appears to be
protected; however, the decreasing costs, and increasing adoption, of
information and networking technologies have created a complex landscape
that has eroded the protections afforded by such policies.
Consequentially, our research has exposed that de-identification provides
little in the form of protection guarantees. In this talk, I will review
various automated approaches we have developed to link patients'
identities to seemingly anonymous biomedical data, often using nothing
more than publicly-available information. Yet, I will also explore why
all hope is not lost and how we can integrate policy with statistical and
computational formalisms to provably measure the risks associated with
sharing data according to various policies, as well as how to provably
protect patients' records from privacy invading attacks without preventing
the workflow of worthwhile biomedical research endeavors. This talk will
draw upon real emerging biomedical research infrastructures, such as
de-identified repositories of electronic medical and genomic records at
the National Institutes of Health.
Biography:
Brad Malin is an Assistant Professor of Biomedical Informatics in the
School of Medicine and an Assistant Professor of Computer Science in the
School of Engineering at Vanderbilt University. He is the founder and
director of Vanderbilt's Health Information Privacy Laboratory (HIPLAB),
which integrates computer science, policy, and biomedical knowledge to
construct privacy enhancing technologies for emerging health information
systems. His research on data privacy in electronic medical and genomic
repositories has received several awards from the American and
International Medical Informatics Associations and has been cited in
various congressional briefings. Among other sponsored research projects,
he currently directs a program in data privacy risk evaluation and
protection for the National Human Genome Research Institute at the
National Institutes of Health. He received a doctorate and master's in
computer science, a master's in public policy and management, and a
bachelor's in biological sciences, all from Carnegie Mellon University.
2 p.m. Thursday February 12, 2009
Title:
LOCST: a Low Complexity Sequence Search Tool
By: Stephen M. Mount, University of Maryland
Venue: Biomolecular Science
Building Room 3118
Abstract: Alignment-based tools such as blast are in
widespread use for identifying similar proteins. Low-complexity regions are typically
not included in such alignments even though they are often important for function. Examples
include argnine-serine-rich proteins involved in splicing and proline-rich, glutamine-rich and
acidic transcription activation domains. An approach for identifying and evaluating similar
low-complexity regions within proteins based on shared repeated dipeptides will be presented,
as will its implementation in the program LOCST (Low Complexity Sequence Search Tool).
This is work was performed with Nicolas Tilmans and Stephen Fiorelli.
2 p.m. Thursday February 26, 2009
Title:
Protein Annotation Prediction By Clustering Within Interaction Networks
By: Carl Kingsford, University of Maryland
Venue: Biomolecular Science
Building Room 3118
Abstract:
Determining protein function is a fundamental biological challenge, and
protein-protein interaction networks are an increasingly useful data source
from which to computationally predict protein annotations. One approach to
automated detection of protein complexes and prediction of biological processes
is to divide an interaction network into biologically meaningful modules or
clusters. I will present several graph clustering techniques and illustrate
their usefulness for predicting protein annotations. I will describe a novel
method to decompose a hierarchical tree decomposition into a collection of
clusters that optimally match a set of known annotations. We find that our
approach generally outperforms commonly used heuristics for identifying protein
complexes from hierarchical clusterings. The technique is general and may be
of use in other applications where hierarchical clustering is used. I will
also show how a graph compression technique called graph summarization leads to
more biologically meaningful modules that other graph clustering algorithms.
Time permitting, I will also describe how protein interaction networks can be
used to transfer functional annotations between species.
4:00 p.m. Monday March 23, 2009
Title:
Characterization of Human Epigenomes
By:
Keji Zhao
Senior Investigator
Laboratory of Molecular Immunology, National Heart, Lung, and Blood Institute, National Institute of Health
Venue:
Room 0467 Animal Sciences Building
2:00 p.m. Thursday March 26, 2009
Title:
Protein recognition and gating in the ribosome exit tunnel
By:
Paula Petrone, Stanford University Department of Biophysics, Group of Prof. Dr. V. Pande
Venue:
Biomolecular Science Building Room 3118
Abstract:
The ribosome is a large complex catalyst responsible for the synthesis of new proteins, an essential function for life. New
proteins emerge from the ribosome through an exit tunnel as nascent polypeptide chains. Recent findings indicate that
tunnel interactions with the nascent polypeptide chain might be relevant for the regulation of translation. However, the
specific ribosomal structural features that mediate this process are unknown. In my talk, I will address the computational
methods I have developed for the study of the physicochemical environment of the tunnel. By looking at the interactions
between components of the ribosome exit tunnel and different chemical probes, our simulations indicate that transport out
of the tunnel could be different for diverse amino acid species. By relating our simulation data to earlier biochemical
studies, our analysis provides a context for interpreting sequence-dependent nascent chain phenomenology in the ribosome
tunnel.
11:00 a.m., Tuesday March 31, 2009
Title:
pplacer: Bayesian phylogenetic placement of metagenomic short reads
By:
Erick Matsen, U.C. Berkeley
Venue:
Biomolecular Science Building Room 3118
Abstact:
An abundance of metagenomic short reads raises a very difficult question for
bioinformaticians: how these short reads fit into previously-characterized
diversity? Equally as important, how do we get confidence intervals on these
placements? In this talk I will present "pplacer", which places short reads in
a user-supplied reference gene tree. Pplacer takes a statistically rigorous
Bayesian approach, where positions of the fragment sequence are evaluated
according to normalized posterior probability; because we are fixing a
reference tree, we can perform direct numerical integration over the likelihood
function to obtain confidence estimates rather than resorting to MCMC. Pplacer
is the first stand-alone such program which allows the user to supply a
reference alignment and tree; it can be used via a simple command line
interface or as part of a pipeline. We have also implemented a large-scale
fragment simulation pipeline which allows the user to empirically determine an
appropriate "cutoff" for accurate short read placement. Such simulations have
also given us a new perspective on what phylogenetic placement scores mean,
namely that posterior probability being spread over a number of locations can
indicate global rather than local uncertainty.
1:00 p.m. Thursday, April 2, 2009
Title:
Novel approaches to metagenomic analysis
By:
James Robert White
Venue:
Biomolecular Science Building Room 3118
(CBCB seminar and presentation for AMSC candidacy exam; note earlier time)
Abstact:
The human body plays host to thousands of bacterial species in a variety of ecosystems. Until recently, microbial
communities have been impossible to investigate thoroughly, as the vast majority of bacteria cannot be cultured through
laboratory techniques. New technologies (e.g. high-throughput sequencing, 16S rRNA surveys) allow us to deeply sample
the genetic content of a microbial environment in order to estimate its overall composition and functional capacity.
Recent studies in this context have revealed that human obesity has a microbial component: obese gut microbiomes are
distinct from the lean population. This result indicates potential therapeutic approaches to treating obesity by
manipulating gut microflora. However, our limited knowledge of the microbial interactions in the gut hinders our
ability to design future experiments or effective treatments. Using 16S rRNA time-series sequence data from obese
individuals on a one-year diet, I have employed a mathematical model to study microbial population dynamics in the
human gut. In this talk I will discuss the model formulation and predicted competitive and commensal interactions among
dominant phyla in the distal gut. I will further discuss the application of this model to estimate the potential impact
of prebiotic and probiotic therapies for treating human obesity. Through this problem, I hope to illustrate the insight
mathematical modeling can bring to the field of metagenomics.
2:00 p.m. Thursday, April 16, 2009
Title:
Towards a de novo Short Read Assembler for Large Genomes using Cloud Computing
By:
Michael Schatz
Venue:
Biomolecular Science Building Room 3118
(CBCB seminar and presentation for preliminary oral examination in Computer Science)
Abstact:
The massive volume of data and short read lengths from next
generation DNA sequencing machines has spurred development of a new class of
short read genome assemblers. Several of the new assemblers, such as Velvet
and Euler-USR, model the assembly problem as constructing, simplifying, and
traversing the deBrujin graph of the read sequences, where nodes in the
graph represent k-mers in the reads, with edges between nodes for
consecutive k-mers. This approach has many advantages for these data, such
as efficient computation of overlapping reads and robust handling of
sequencing errors, and has demonstrated success for assembling small to
moderately sized genomes. However, this approach is computationally
challenging to scale to mammalian-sized genomes because it requires
constructing and manipulating a graph far larger than can fit into memory.
Drawing on the success of
CloudBurst, a MapReduce-based short read
mapping algorithm capable of mapping millions of reads to the human genome
with high sensitivity, we have developed a MapReduce-based short read
assembler that shows tremendous potential for enabling de novo assembly of
mammalian-sized genomes. The deBrujin graph is constructed with MapReduce by
emitting and then grouping key-value pairs (ki,kj) between successive k-mers
in the read sequences. After construction, MapReduce is used again to
execute parallel graph transformations to remove spurious nodes and edges
from the graph caused by sequencing error in the reads, and to compress
simple chains of nodes into long sequence nodes representing the unambiguous
regions of the genome between repeat boundaries. The resulting graph is a
small fraction of the size of the original deBrujin graph, and is output in
a format compatible with other short read assemblers for additional
analysis.
2:00 p.m. Tuesday, April 28, 2009
Title:
Whole-Genome Sequence Analysis for Pathogen Detection and Diagnostics
By: Adam Phillippy
Venue:
Biomolecular Science Building Room 3118
(CBCB seminar and presentation for preliminary oral examination in Computer Science)
Abstact:
Pathogenic microbes, both natural and weaponized, pose
significant dangers to human health and safety. To effectively prevent
infection and fight disease, it is essential to rapidly detect and
characterize pathogens in any environmental or clinical medium with high
accuracy. Now that the genome sequences of hundreds of bacteria and viruses
are known, it is possible to design biomolecular tests to rapidly detect and
characterize pathogens based solely on their DNA. These tests can detect a
pathogen in a complex mixture of organic material by recognizing short,
distinguishing sequences (called DNA signatures) that occur in the pathogen
and not in any other species.
I will present a novel computational method, called Insignia, for
identifying DNA signatures, and show that these signatures can be used as
the basis for biomolecular assays to detect and genotype pathogens in
real-time and with high accuracy. Insignia utilizes highly efficient string
algorithms and distributed computing to compare over 100 billion nucleotides
of genomic DNA from bacteria, virus, plants, animals, and human. The results
of this computation are stored in a unique data structure that compresses
the data and permits rapid retrieval of genomic signatures for any set of
target genomes. Signature retrieval is made available through a web
application, making it accessible to users who may lack high-throughput
computing resources. Hundreds of signatures identified by Insignia have
undergone rigorous laboratory validation, showing that they are both
sensitive and specific for detection of pathogens at the species level.
2:00 p.m. Thursday, May 14, 2009
Title:
Whole Genome Profiling - a novel method for physical mapping and
sequence assembly
By: Jan van Oeveren, KeyGene
Venue: Biomolecular Science
Building Room 3118
Abstact:
We developed a novel technology, whole genome profiling (WGP), which
uses the power of next generation sequencing technologies, such as
Illumina GA, to identify unique sequence tags and construct high
quality physical maps. These maps are constructed by sequence-based
fingerprinting of a BAC library. Pooled BAC clones are identified by
amplified restriction fragments and the ends of these fragments are
sequenced to obtain WGP tags, most of which are unique in the genome.
Subsequently, fingerprint contig building software is used to align the
tagged BACs into a whole genome physical map. The resulting map with
unique WGP tags will provide anchor points to link sequence read
assemblies and thus integrate WGP with WGS to obtain high quality, high
coverage genome assemblies.
|
|