Metagenomics (studying
uncultured bugs)
Metagenomics is a newly emerging field that aims to
study environmental samples (bacterial or viral) directly instead of
trying to
culture
constituents of the sample in a lab. The advantages of metagenomic
studies are twofold: the notoriously difficult step of producing pure
cultures can be bypassed and secondly it enables microbiologists to
understand the entire ecosystem of a sample rather than just study
individual bacterial or viral constituents. One of the popular
approaches for metagenomic studies is to isolate and sequence a
universal marker gene such as
16S rRNA
(that is typically conserved within a bacterial species but is
different between species) to quantify the composition of a sample.
Since, the function from similarity between 16S sequences to species
classes is unknown (in any case the notion of species at the bacterial
level is quite subjective), researchers typically rely on various
clustering algorithms (e.g.
single linkage clustering)
and ad hoc thresholds to produce a coarse approximation to the notion
of a species. In a recent work (White et al., In Preparation), we show
that these approximations can indeed be very poor and can lead to
incorrect estimates for microbial diversity. In principle,
semi-supervised clustering approaches that exploit information in
databases of known 16S sequences, can alleviate some of these problems.
In another recent work (Saket et al, 2009) we propose a new and general
semi-supervised clustering algorithm and show that it can indeed
approximate the notion of species more accurately, even with sparsely
labelled input.
Metagenomic samples are
increasingly being used to investigate the correlation between the
abundance of various species and phenotypes of the sample. For example,
a recent study (Ley et al., 2006) reported that certain divisions of
bacteria were significantly more/less abundant in obese humans when
compared to lean humans. An efficient and sensitive statistical
methodology is required to do such analysis and we recently proposed a
tool for this (White et al., 2009) that more robustly handles sparsely
sampled features and can be applied to
SAGE data as well.