Motif Finding
DNA and protein sequences often contain
short motifs that are conserved within and across organisms
due to their functional importance. Searching for these conserved
motifs (often called
motif
finding) therefore serves as a useful approach to
identifying novel functional sequence elements and several
computational tools have been designed for this purpose (such as
MEME,
GLAM and
GIMSAN).
In addition to the computational difficulty of searching for small
patterns in long, largely random strings, motif finding programs also
face the challenge of determining if a putative motif is statistically
significant (i.e. unlikely to be present by chance in the dataset).
Under a commonly used model, computing the significance of motifs
can be surprisingly hard -- in our early work (
Nagarajan et al., 2005) we showed that approximations used in two popular motif finding tools can be quite off. In response, we proposed a new,
discrete Fourier trasform
based algorithm that can accurately and efficiently compute the
significance of motifs (ungapped multiple alignments). The novel
technique proposed in this work is suprisingly general and we have
since explored its use in the context of designing efficient exact
algorithms for computing p-values for other common statistical tests as
well (see also:
Computational Statistics).
In subsequent work, we also explored other models to compute the significance of motifs (
Ng et al., 2006) as well as new scores to optimize for, while searching for the "best" motif (
Ng et al., 2006,
Nagarajan et al., 2006).
Interestingly, a measure of statistical significance of a motif, the
"E-value", can serve as an ideal score for searching for motifs, and
using some clever
memoization, this can be done with a small computational overhead (
Nagarajan et al., 2006). The algorithms described in these works are available as library functions and programs in the
FAST package (
Nagarajan and Keich, 2008).