CBCB Seminar Series

Summer 2004

4 p.m. Thursday June 17, 2004

Title: Organizational meeting & research projects presentation
Venue: Computer Science Instructional Building Room 3120.
Abstract: To discuss the schedule in Summer 2004.

4 p.m. Thursday June 24, 2004

Title: SNPs Problems, Complexity and Algorithms
Speaker: Xue Wu
Venue: Computer Science Instructional Building Room 2120.
Abstract:

Single nucleotide polymorphisms (SNPs) are the most frequent form of human genetic variation. They are of fundamental importance for a variety of applications including medical diagnostic and drug design. They also provide the highest-resolution genomic fingerprint for tracking disease genes. This paper is devoted to algorithmic problems related to computational SNPs validation based on genome assembly of diploid organisms. In diploid genomes, there are two copies of each chromosome. A description of the SNPs sequence information from one of the two chromosomes is called SNPs haplotype. The basic problem addressed here is the Haplotyping, i.e., given a set of SNPs prospects inferred from the assembly alignment of a genomic region of a chromosome, find the maximally consistent pair of SNPs haplotypes by removing data "errors" related to DNA sequencing errors, repeats, and paralogous recruitment.

References:

Background knowledge of SNPs : NIH introduction to SNPs

Recent research papers:

Vineet Bafna, Bjarni V. Halldorsson, Russell Schwartz, Andrew G. Clark, Sorin Istrail. "Haplotypes and informative SNP selection algorithms: don't block out information." RECOMB 2003: 19-27
Eleazar Eskin, Eran Halperin and Richard M. Karp. "Large Scale Reconstruction of Haplotypes from Genotype Data.", RECOMB 2003
Lei Li, Jong Hyun Kim, Michael S. Waterman: "Haplotype reconstruction from SNP alignment." RECOMB 2003: 207-216

Survey papers:

Paola Bonizzoni, Gianluca Della Vedova, Riccardo Dondi, Jing Li, "The Haplotyping Problem: An Overview of Computational Models and Solutions." Journal of Computer Science and Technology, 18(6):675-688, 2003

4 p.m. Thursday July 1, 2004

Title: Discovering molecular pathways from protein interaction and gene expression data
Speaker: Woei-Jyh (Adam) Lee
Venue: Computer Science Instructional Building Room 2120.
Abstract:

In this paper, we describe an approach for identifying pathways from gene expression and protein interaction data. Our approach is based on the assumption that many pathways exhibit two properties: their genes exhibit a similar gene expression profile, and the protein products of the genes often interact. Our approach is based on a unified probabilistic model, which is learned from the data using the EM algorithm. We present results on two Saccharomyces cerevisiae gene expression data sets, combined with a binary protein interaction data set. Our results show that our approach is much more successful than other approaches at discovering both coherent functional groups and entire protein complexes.

References:

Segal E, Wang H, Koller D. "Discovering molecular pathways from protein interaction and gene expression data.", Bioinformatics. 2003 Jul;19 Suppl 1:I264-I272.
Segal E, Yelensky R, Koller D. "Genome-wide discovery of transcriptional modules from DNA sequence and gene expression." Bioinformatics. 2003 Jul;19 Suppl 1:I273-I282.

4 p.m. Thursday July 8, 2004

Title: Exploring Deterministic Reconstruction of Repetitive DNA for Use in Genome Assembly
Speaker: Suzanne Sindi
Venue: Computer Science Instructional Building Room 2120.
Abstract:

Whole Genome Shotgun Assembly is a method for determining the sequence of a genome. The presence of highly repetitive DNA complicates this method and can impact the accuracy of the final sequence assembled. Using an approach from symbolic dynamical systems we present a way to represent highly repetitive sequences of DNA. We discuss potential applications of these representations to Whole Genome Shotgun Assembly.

4 p.m. Thursday July 22, 2004

Title: Extracting synonymous gene and protein terms from biological literature
Speaker: Rezarta Islamaj
Venue: Computer Science Instructional Building Room 2120.
Abstract:

Genes and proteins are often associated with multiple names. More names are added as new functional or structural information is discovered. Because authors can use any one of the known names for a gene or protein, information retrieval and extraction would benefit from identifying the gene and protein terms that are synonyms of the same substance. We have explored four complementary approaches for extracting gene and protein synonyms from text, namely the unsupervised, partially supervised, and supervised machine-learning techniques, as well as the manual knowledge-based approach. We report results of a large scale evaluation of these alternatives over an archive of biological journal articles. Our evaluation shows that our extraction techniques could be a valuable supplement to resources such as SWISSPROT, as our systems were able to capture gene and protein synonyms not listed in the SWISSPROT database.

References:

Yu H, Agichtein E. "Extracting synonymous gene and protein terms from biological literature." Bioinformatics. 2003 Jul;19 Suppl 1:I340-I349.

4 p.m. Thursday July 29, 2004

Title: Splice site prediction: the general problem, the proposed methods, their characteristics and differences
Speaker: Rezarta Islamaj
Venue: Computer Science Instructional Building Room 2120.
Abstract:

Splice sites have been modeled by a variety of methods over the past twenty years. Still the search for improvement continues as splice site detection is a key ingredient for accurate gene finding. I will make a review of several methods widely used in the literature aiming to state their characteristics as well as differences. I will try to illustrate my ideas with several experiments and test results. The results show that we reach the best performance by combining boosted decision trees as a modeling framework with information from a larger sequence window. However, the splice site prediction problem is far from over. Currently, I am continuing my experiments examining the possibilities of improvement.

4 p.m. Thursday August 12, 2004

Title: Indexing techniques in protein structural comparison
Speaker: Elena Zotenko
Venue: Computer Science Instructional Building Room 2120.
Abstract:

Given a query protein the ability to identify all structurally similar proteins is of primary importance in the study of protein evolution and function. As the number of protein structures grows there is a need to develop screening methods that will perform quick yet accurate filtering of the database before a more computationally expensive protein structure comparison method is applied.

The long term objective of my research is to develop such screening method for VAST (Vector Alignment Search Tool), a protein structure comparison method used at NCBI.

In this talk I am going to give an overview of protein structure and protein structure comparison methods. Then I will concentrate on two approaches to index protein structures. Finally I will talk about my research in the past several months: investigating a possibility of a screening method that borrows from the above two approaches.

References:

L.Holm, C.Sander, "3-D lookup: fast protein structure database searches at 90% reliability", Proc Int Conf Intell Syst Mol Biol, 1995,3:179-87
P.Rogen, B.Fain, "Automatic classification of protein structure by using Gauss integrals", PNAS, 2003, 100:119-124
P.Rogen, B.Henrik, "A new family of global protein shape descriptors", Mathematical Biosciences, 2003, 182:167-181

4 p.m. Thursday August 26, 2004

Title: Allosteric determinantsin guanine nucleotide-binding
Speaker: Nozomi Sakakibara
Venue: Computer Science Instructional Building Room 2120.
Abstract:

For mapping energetic interactions in proteins, a technique was developed that uses evolutionary data for a protein family to measure statistical interactions between amino acid positions. For the PDZ domain family, this analysis predicted a set of energetically coupled positions for a binding site residue that includes unexpected long-range interactions. Mutational studies conthrm these predictions, demonstrating that the statistical energy function is a good indicator of thermodynamic coupling in proteins. Sets of interacting residues form connected pathways through the protein fold that may be the basis for efthcient energy conduction within proteins.

Members of the G protein superfamily contain nucleotide-dependent switches that dictate the specificity of their interactions with binding partners. Using a sequence-based method termed statistical coupling analysis (SCA), we have attempted to identify the allosteric core of these proteins, the network of amino acid residues that couples the domains responsible for nucleotide binding and protein-protein interactions. One-third of the 38 residues identified by SCA were mutated in the G protein Gs?, and the interactions of guanosine 5'-3-O-(thio)triphosphate- and GDP-bound mutant proteins were tested with both adenylyl cyclase (preferential binding to GTP -Gs?) and the G protein ?? subunit complex (preferential binding to GDP-Gs?). A two-state allosteric model predicts that mutation of residues that control the equilibrium between GDP- and GTP-bound conformations of the protein will cause the ratio of affinities of these species for adenylyl cyclase and G?? to vary in a reciprocal fashion. Observed results were consistent with this prediction. The network of residues identified by the SCA appears to comprise a core allosteric mechanism conferring nucleotide-dependent switching; the specific features of different G protein family members are built on this core.

References:

Steve W. Lockless, Rama Ranganathan, "Evolutionarily Conserved Pathways of Energetic Connectivity in Protein Families", Science, Vol 286, Issue 5438, 295-299, 8 October 1999
Mark E. Hatley, Steve W. Lockless, Scott K. Gibson, Alfred G. Gilman and Rama Ranganathan, "Allosteric determinants in guanine nucleotide-binding", PNAS, Vol. 100, No. 24, 14445-14450, 25 November 2003