CBCB Seminar Series
Fall 2005
2 p.m. Wednesday August 31, 2005
Title: organizational meeting
By: Stephen M. Mount, Ph.D.
Venue: A.V. Williams Building Room
3258
Abstract: To discuss the
schedule in Fall 2005.
2 p.m. Wednesday September 14, 2005
Title: A modular assembly package
and applications
By: Mihai Pop, Ph.D.
Venue: Biomolecular Sciences
Building #296 Room 3118
Abstract:
Recent developments in assembly algorithms have been a key element in our
ability to sequence the genomes of organisms. Most notably, advances in
genome assembly algorithms have led to an accelerated completion of the
human genome project. Existing assembly programs, such as Celera
Assembler and Arachne, are, however, difficult to use and often require
user intervention to obtain the best assembly. Furthermore, they are
ill-suited to specialized assembly tasks such as the assembly of highly
polymorphic genomes (e.g. the sea squirt Ciona savigniy) or the assembly
of uncultured organisms directly from the environment (e.g. metagenomic
analysis of the bacterial populations in the Sargasso Sea). To address
these issues we set out to implement a flexible framework for the
development of assembly algorithms. This project, AMOS, provides
scientists with well documented interfaces, a uniform representation of
assembly data, as well as numerous utilities for manipulating and
analyzing genome assemblies.
During my talk I will provide you with an overview of the AMOS project. I
will also describe examples of the applications of AMOS to comparative
genome assembly, metagenomic analyses and heterozygous SNP detection.
2 p.m. Wednesday September 21, 2005
Title: An Introduction to
Probabilistic Relational Models for Biological and Clinical Applications
By: Lise Getoor, Ph.D.
Venue: Biomolecular Sciences
Building #296 Room 3118
Abstract:
A large portion of real-world biological data is stored in commercial
relational database systems. In contrast, most statistical learning
methods work only with "flat" data representations. Thus, to apply these
methods, we are forced to convert the data into a flat form, thereby
losing much of the relational structure present in the data and
potentially introducing statistical skew. These drawbacks severely limit
the ability of current methods to mine relational databases.
In this talk I will review recent work on probabilistic models, including
Bayesian networks (BNs) and Markov Networks (MNs) and their relational
counterpoints, Probabilistic Relational Models (PRMs) and Relational
Markov Networks (RMNs). I'll briefly describe the development of
techniques for automatically inducing PRMs directly from structured data
stored in a relational or object-oriented database. These algorithms
provide the necessary tools to discover patterns in structured data, and
provide new techniques for mining relational data. As we go along, I'll
present experimental results in several domains, including a biological
domain describing tuberculosis epidemiology, a database of scientific
paper author and citation information, and Web data.
2 p.m. Wednesday September 28, 2005
Title: Improving Genome Assemblies
without Sequencing
By: Michael Schatz
Venue: Biomolecular Sciences
Building #296 Room 3118
Abstract:
Genome assembly is the problem of reconstructing the genome sequence
of an organism from a collection of short sequenced reads. An assembly
takes the form of contiguous stretches of DNA sequence (contigs)
linked together in scaffolds by mate-pair and other information.
Genome assembly is scientifically one of the most important areas of
bioinformatics research as an accurate genome sequence is needed for
addressing several fundamental biological questions. Unfortunately, it
is also one of the most complex computationally, having been proved
NP-hard under various formalisms and a typical problem size of
thousands or millions of inputs.
During my talk, I will discuss some of the algorithmic challenges and
trade-offs in genome assembly. I will also discuss some computational
methods for improving an assembly, which can be applied generally but
without requiring additional laboratory results. One method was
implemented in AutoEditor, which acts as a second generation
base-caller to find and correct base-calling errors in reads using the
original chromatogram trace and the multiple alignment of reads. A
second was implemented in AutoJoiner, which attempts to automatically
close gaps between linked contigs, and generally enhance contig
quality, by extending the usable portion of reads within an assembly.
2 p.m. Wednesday October 5, 2005
Title: Aggressive Enumeration of
Peptide Sequences for Peptide Identification by Tandem Mass Spectrometry
By: Nathan Edwards, Ph.D.
Venue: Biomolecular Sciences
Building #296 Room 3118
Abstract:
Peptide identification from tandem mass spectra is a critical part of
comprehensive proteome analyses. The search engines that analyze these
spectra, such as Mascot, SEQUEST, X!Tandem, or OMSSA, use amino-acid
sequence databases, such as SwissProt, to provide putative peptides to
compare against each spectrum. This approach fails to identify peptides
missing from the sequence database. We argue that amino-acid sequence
databases used for peptide identification should be aggressively inclusive
of potential peptide sequences, rather than conservative, and show that
this need not increase search engine running times significantly.
We begin with a whirlwind tour of peptide identification from tandem mass
spectra and methods used to analyze them. We'll quickly cover the
traditional methods used to populate protein sequence databases to
understand why peptide sequences might be left out. We show how different
types of peptide sequence evidence might be aggressively integrated into
an inclusive peptide sequence database. Further, we demonstrate that
efficiently represented, an inclusive amino-acid sequence database of
peptides can, in some cases, be smaller than sequence databases in common
useage.
On going research with Chau-Wen Tseng and Xue Wu.
2 p.m. Wednesday October 12, 2005
Title: A Methodology to Enhance
the Semantics of Links Between PubMed Publications and Markers in the
Human Genome
By: Woei-Jyh (Adam) Lee
Venue: Biomolecular Sciences
Building #296 Room 3118
Abstract:
Links in life science sources capture important biological knowledge.
However, current simple physical link implementations do not explicitly
represent this knowledge so that it can be easily shared among scientists.
We develop a methodology for link extraction and generation, and link
labeling to produce an enhanced e-link or an enhanced set of
e-links. The e-link associates each existing link with a
link label that captures semantics of the link. We develop a machine
assisted tool for curators to produce e-links and we develop a
search interface for biologists to discover interesting e-links.
On going research with Prof. Louiqa Raschid and Dr. Alex Lash.
2 p.m. Wednesday October 19, 2005
Title: Peptide Identification by
Tandem Mass Spectrometry
By: Xue Wu
Venue: Biomolecular Sciences
Building #296 Room 3118
Abstract:
Tandem mass spectra and associated database search algorithms are
essential proteomic tools for identifying peptides. Even though the
algorithms are effective and extensively used for identification, there
are still spectra unindentified by these searches. Since the database
search approach compares the theoretical spectra constructed from database
peptides and experimental spectra, assign scores for the matches, score
function is the key factor that decides the sensitivity and specificity of
the search algorithm.
In this talk, I will present my recent research about sensitivity and
specificity of database search score functions. I will briefly describe
the most popular database search algorithms: Mascot, X!Tandem and OMSSA,
then discuss the difference by experimental results and theoretical
analysis.
2 p.m. Wednesday October 26, 2005
Title: New Perspectives on Protein
Sequences and Structures
By: William R.
Pearson, Ph.D. (University of Virginia)
Venue: Biomolecular Sciences
Building #296 Room 2118
Abstract:
High-throughput DNA sequencing, fast computers, and effective sequence
analysis algorithms produced a new approach to biological problems -
genome biology. Comprehensive sequence databases and effective
sequence programs like BLAST and FASTA are now routinely used to
annotate bacterial and eukaryotic genomes. While complete genome
sequencing is a cost-effective strategy for characterizing an
organism, the sequences also allow us to address fundamental questions
about protein folding, and the nature of protein space.
In this talk, I will discuss why protein sequence comparison "works"
-- why it is an effective strategy for finding homologous proteins
with similar structures. I will then extend that observation to ask
what the success of protein sequence comparison may tell us about the
constraints that allow them to fold. Specifically, I will consider
several widely held beliefs about protein sequences and structures:
(1) a small fraction of the protein's sequence determines the
structure; (2) structure comparison programs are much more sensitive
for finding distant homologs than sequence comparison; (3) there are
strong constraints on protein sequences producing "fold-able" motifs.
These beliefs reflect the view that protein folding is very very hard,
and that the "space" of proteins that can fold has been largely
explored by nature. I will argue that current, comprehensive sequence
and structure information supports a different model of the protein
universe, in which known proteins represent the small fraction of the
possible proteins.
2 p.m. Wednesday Novmber 2, 2005
Title: Structural views of
disease-causing mutations in PAH gene and using genetic evolution
to understand diseases' history
By: Zhen Shi
Venue: Biomolecular Sciences
Building #296 Room 3118
Abstract:
The first part, I'll discuss the correlation of the Phenylketonuria
phenotypes with structural defects of mutations on the responding protein,
Phenylalanine Hydroxylase(PAH). And we pose destabilization of PAH as a
major effect of most disease causing mutations. In the second part, more
like a journal club, I want to discuss how the evolutionary and population
genetics can help us uncover the history of some diseases like Cystic
Fibrosis(CFTR gene), HIV(CCR5) et al.
2 p.m. Wednesday Novmber 9, 2005
Title: Power Laws for Repeat
Strings in DNA
By: Suzanne Sindi
Venue: Biomolecular Sciences
Building #296 Room 3118
Abstract:
While it is often remarked that genomes contain significant amounts of
repetitive sequence, what is meant by repetitive sequence is left
ambiguous and generally includes cases where two segments of the genome
are very similar but not necessarily identical. To avoid this ambiguity we
define a representation of repetitive DNA we call a repeat string.
We observe and discuss power law distributions in the size and frequency
of repeat strings in the genomes of C. elegans, A. thaliana and Human
chromosome 21. We have developed a iterative evolutionary model that may
explain one of the observed power laws.
While the cause of the power laws remains unknown, these results suggest
that there may be more statistical structure than previously thought in
repetitive DNA. This structure may suggest ways to more efficiently
assemble repetitive regions of DNA.
2:30 p.m. Thursday Novmber 17, 2005
Title: Trends in North American
Butterfly Populations using the 4th of July Butterfly Counts
By: Leslie Ries
Venue: Biomolecular Sciences
Building #296 Room 3118
Abstract:
The 4th of July butterfly counts is a nation-wide program to collect date
on the distribution and abundance of butterflies. Started in 1975 with 34
counts across the country, there are currently almost 500 counts that
occur each year. In each survey, volunteers count all the butterflies
they see in a single day (usually near the 4th of July) in a set location.
Although over 6000 surveys have been conducted over the past 30 years, no
rigorous analyses of the data has ever occured. I recently acquired this
data set, and am now analyzing it to discover species that are showing any
shifts in populations (either increasing, declining or shifting their
range). I will present some of the challenges of working with this large
data set, and also present some preliminary results on trends in monarch
populations. This analysis focuses on comparing summer, breeding
abundances of the migrating monarch butterfly with population densities on
their overwintering grounds in Mexico. Our results suggest that summer
reproduction, rather than winter mortality drives population trends.
2 p.m. Thursday December 1, 2005
Title: Trimming Vector from Reads
By: Michael Roberts, Ph.D.
Venue: Biomolecular Sciences
Building #296 Room 3118
Abstract:
Whole genome shotgun assembly (WGSA) is a method of determining the
sequence of a genome. The genome is broken into overlapping pieces, called
reads, for which the sequence can be determined. Unfortunately, there is
often a contaminant, called vector, on the end of a read. Trimming vector
from reads is an important part of WGSA, and has a large impact on the
quality of the final assembly. Unfortunately, accurate techniques for
vector trimming are not publicly available. We will present a new,
accurate method of vector trimming.
2 p.m. Thursday December 8, 2005
Title: Gene Name Normalization
using Text Match with Automatically Extracted Synonym Dictionaries
By: Haw-ren Fang
Venue: Biomolecular Sciences
Building #296 Room 3118
Abstract:
Gene normalization, a relatively new and unexplored problem, has gradually
received attention in recent years. The procedure typically consists of
two stages: identifying gene mentions and normalization of gene names.
Besides, there are usually a pre-processed synonym dictionary and a post
disambiguation stage.
Compared with identifying gene mentions, gene normalization is easier
because identification of textual boundaries of each mention is not
required. However, gene normalization requires the actual gene detected
and reported in the unique gene normal form. From this point of view, it
is harder than identifying gene mentions.
We have built a robot that can automatically extract human gene synonyms
from online databases to build our synonym dictionaries (300,000+
entries). In the first stage, a CRF tagger is used to automatically
annotates given abstracts or documents. For the second stage, we compiled
various string transformations that can be applied and chained in flexible
order, followed by exact string matching or approximate string matching.
Our system achieved 0.648 F-measure (0.597 precision and 0.709 recall).
Gene normalization has several potential applications, such as biomedical
information extraction, database curation, and further text mining. Our
first application is the relevance search and ranking in the Fable
project. Providing a proper synonym dictionary, our normalization program
is readily to generalize to other organisms and name normalization tasks
(e.g., malignancy, variation, etc.).
This is joint work with Peter S. White (mentor) et al. in Children's
Hospital of Philadelphia.
11 a.m. Tuesday December 20, 2005
Title: Comparative and functional
genomic approaches to the analysis of gene function in three human
parasites
By: Najib M.
El-Sayed, Ph.D. (TIGR)
Venue: HJ Patterson
Hall Room 2242
PS: This is a candidate talk.
|
|