Computational Statistics
Statistical tests play an integral role
in Bioinformatics in helping to distinguish biologically meaningful
signals from random artifacts of the experimental data. For example,
one of the commonly used tests is a log-likelihood ratio test to
compare a column of a multiple alignment against a background
distribution of nucleotide or amino acid frequencies. The applications
for this test have ranged from motif-finding to protein
structure prediction and have typically relied on a χ
2 approximation
which be can very inaccurate in the tail of the distribution. To
address this, we designed a numerically robust exact algorithm for this
problem (bagFFT) that can accurately estimate very small probabilities
while being asymptotically more efficient than existing algorithms (
Keich and Nagarajan,
2006). The bagFFT algorithm is based on a new technique to
compensate for numerical errors in FFT-based calculations with a wide
variety of possible applications in Computational Statistics. In recent
work we have also explored extensions of this approach to
compute the statistical significance of multiple alignments (
Nagarajan et
al., 2005)
as well as to compute the significance of the
Mann-Whitney test (Nagarajan and Keich, 2009). As part of these studies
we succeeded in established the failings of several popular
approximations for these tests, such as a normal or a χ
2 approximation, especially
in computational applications where large hypothesis spaces lead to the
need to accurately estimate small probabilities. In addition, we also
demonstrate that numerical instability can plague some of the
published algorithms for these tests (e.g. Baglivo's algorithm,
Harding's algorithm).