2:00 p.m., Wednesday, April 14, 2010

Title: "Whole-Genome Sequence Analysis for Pathogen Detection and Diagnostics"

By: Adam Phillippy, CBCB

Venue: 3118 Biomolecular Sciences

Abstact: Pathogenic microbes, both natural and weaponized, pose significant dangers to human health and safety. To defend against these threats, it is essential to rapidly detect and characterize pathogens in any environmental or clinical medium with high accuracy. Now that the genome sequences of thousands of bacteria and viruses are known, it is possible to design biomolecular tests to rapidly detect and characterize pathogens based solely on their DNA. Possible applications are far-reaching and include real-time clinical diagnosis and biosurveillance. However, these tests require sophisticated computational design and analysis to operate effectively.

This dissertation presents novel computational methods for improving the accuracy of three modern diagnostic technologies: polymerase chain reaction (PCR), array comparative genomic hybridization (CGH), and whole-genome sequencing. For designing real-time PCR detection assays, an efficient search algorithm and data structure are presented for analyzing over 100 billion nucleotides of genomic DNA to identify the most distinguishing sequences of a pathogen. Laboratory validation shows that these "signature" sequences can be used to detect pathogens in complex samples and differentiate them from their non-pathogenic relatives. For CGH, pan-genome array design and analysis algorithms are presented for the characterization of microbial isolates. These methods are used to study multiple strains of the foodborne pathogen, Listeria monocytogenes, revealing new insights into the diversity and evolution of the species. Finally, multiple methods are presented for the validation of whole-genome sequence assemblies. These validated assemblies provide the ultimate diagnostic, decoding the entire DNA sequence of a genome with high confidence.

A Dissertation Defense for the degree of Ph.D. in Computer Science

2:00 p.m., Thursday, April 15, 2010

RECOMB 2010 Practice Talks

Venue: 3118 Biomolecular Sciences directions

Title (RECOMB 2010 Practice Talk): "Dense Subgraphs with Restrictions and Applications to Gene Annotation Graphs"

Authors: Barna Saha, Allison Hoch, Samir Khuller, Louiqa Raschid and Xiao-Ning Zhang

Speaker: Barna Saha, a third year Computer Science graduate student working with Samir Khuller on algorithm design and analysis.

Abstract: We focus on finding complex annotation patterns representing novel and interesting hypotheses from gene annotation data. We define a generalization of the densest subgraph problem by adding an additional distance restriction (defined by a separate metric) to the nodes of the subgraph. We show that while this generalization makes the problem NP-hard for arbitrary metrics, when the metric comes from the distance metric of a tree, or an interval graph, the problem can be solved optimally in polynomial time. We also show that the densest subgraph problem with a specified subset of vertices that have to be included in the solution can be solved optimally in polynomial time. In addition, we consider other extensions when not just one solution needs to be found, but we wish to list all subgraphs of almost maximum density as well. We apply this method to a dataset of genes and their annotations obtained from The Arabidopsis Information Resource (TAIR). A user evaluation confirms that the patterns found in the distance restricted densest subgraph for a dataset of photomorphogenesis genes are indeed validated in the literature; a control dataset validates that these are not random patterns. Interestingly, the complex annotation patterns potentially lead to new and as yet unknown hypotheses. We perform experiments to determine the properties of the dense subgraphs, as we vary parameters, including the number of genes and the distance.
-------------
Title (RECOMB 2010 Practice Talk): "Extracting between-pathway models from E-MAP interactions using expected graph compression"

Speaker: David Kelley

Abstract: Genetic interactions (such as synthetic lethal interactions) have become quantifiable on a large-scale using the epistatic miniarray profile (E-MAP) method. An E-MAP allows the construction of a large, weighted network of both aggravating and alleviating genetic interactions between genes. By clustering genes into modules and establishing relationships between those modules, we can discover compensatory pathways. We introduce a general framework for applying greedy clustering heuristics to probabilistic graphs.We use this framework to apply a graph clustering method called graph summarization to an E-MAP that targets yeast chromosome biology. This results in a new method for clustering E-MAP data that we call Expected Graph Compression (EGC). We validate modules and compensatory pathways using enriched Gene Ontology annotations and a novel method based on correlated gene expression from a comprehensive collection of expression experiments. EGC finds a number of modules that are not found by any of the previous methods to cluster E-MAP data. Further, EGC uncovers core submodules contained within several previously found modules, suggesting that EGC can reveal the finer structure of E-MAP networks.

1:00 p.m., Friday, April 16, 2010

Title: " High Performance Computing for DNA Sequence Alignment and Assembly"

By: Michael C. Schatz, CBCB

Venue: 3118 Biomolecular Sciences

Abstact: We are at the dawn of a new era in computational biology. DNA sequencing projects that required years of effort and hundreds of millions of dollars of equipment just a few years ago, can now be performed quickly and cheaply by individual labs. This dramatic shift is expanding the scale and scope of sequencing to previously unimaginable limits, and will ultimately lead to new discoveries about our basic biology, the diversity of life, and personalized medicine. However, these ambitious goals can only be realized if we can develop new computational methods that can effectively analyze the overwhelming volumes of data generated.

In my presentation, I'll describe my research developing efficient methods for analyzing large biological datasets, including by using highly parallel commodity graphics processing units produced by nVidia, and the parallel computing framework MapReduce developed by Google. My dissertation research demonstrates how these technologies can be applied to the critical tasks of large-scale alignment and assembly, enabling genotyping and de novo assembly of whole genome genomes from billions of short reads. Coupled with inexpensive cloud computing, these programs can quickly, cheaply, and accurately analyze tremendous biological datasets and have the potential to make otherwise infeasible studies practical.

A Dissertation Defense for the degree of Ph.D. in Computer Science

2:00 p.m., Thursday, April 29, 2010

Title: "Structural Assembly of Molecular Complexes Based on Residual Dipolar Couplings"

Speaker: Konstantin Berlin, a finishing PhD student in Computer Science

Venue: 3118 Biomolecular Sciences directions
Abstact: We present PATI, a computationally efficient and accurate abinitio predictor of the residual dipolar couplings (RDCs) from a protein structure. Building upon PATI, we develop and evaluate a rigid-body molecular docking method, called PATIDOCK, that relies solely on the three-dimensional structure of the individual components and the experimentally derived RDCs for the complex, and show that it is possible to accurately assemble a protein-protein complex by utilizing PATI to guide the docking method. The proposed docking method is robust against experimental errors in the RDCs and computationally efficient. We analyze the accuracy and efficiency of this method using experimental or synthetic RDC data for several proteins, as well as synthetic data for a large variety of protein-protein complexes. We also test our method on two protein systems for which the structure of the complex and steric-alignment data are available (Lys48-linked diubiquitin and a complex of ubiquitin and a ubiquitin-associated domain) and analyze the effect of flexible unstructured tails on the outcome of docking. The results demonstrate that it is fundamentally possible to assemble a protein-protein complex based solely on experimental RDC data and the prediction of the alignment tensor from three-dimensional structures. Additionally we show a method for combining RDCs with other experimental data, such as ambiguous constraints from interface mapping, to further improve structure characterization of the protein complexes.