Patro Works to Improve Search Functionality for Large Repositories of Sequencing Data

Sep 02, 2020

Genomic sequencing data can often shed light on a wide array of scientific problems—from treating patients with heart disease and cancer to understanding how certain pathogens can affect plants and animals.

Public repositories of genomic data are becoming more commonplace and are growing at an exponential rate. The National Center for Biotechnology Information (NCBI), for example, runs the Sequence Read Archive (SRA), a repository that holds raw scientific data for a vast number of scientific experiments conducted using high-throughput sequencing data.

While repositories like the SRA are a boon to researchers, the ability to quickly search these large public databases for a specific sequence of interest is limited.

To that end, Rob Patro (in photo), an associate professor of computer science, is leading an effort to develop fundamentally improved data structures and algorithms to enable large-scale sequence search across public repositories of genomic data.

Patro is collaborating on this work with Mike Ferdman and Michael Bender, both at Stony Brook University, and Rob Johnson, a senior staff researcher at VMWare Research group.

Patro says that making these types of databases easier to search can help researchers get to the bottom of unique problems.

“Imagine a scientist who discovers what they imagine to be a novel environmental pathogen that, perhaps, has a negative effect on plant or animal health,” he says. “It may be very useful to know what other previously-collected data might have contained this pathogen, but its presence was not reported because the scientists working on the previous studies were focused on a different question and did not take note of this new pathogen. Combing through the repository of available data might help us learn more about this pathogen.”

Patro says that the technical challenge in addressing such a problem is simply the scale of the data. The SRA, for example, currently contains about 16 petabytes of data.

Sequencing data is quite different from other domains where similar challenges have been tackled, he adds.

“DNA and RNA do not have a well-defined vocabulary of words or ‘tokens’ into which the data can be broken down,” he says. “So it’s very different from a typical internet search query.”

Patro’s research group has recently been working on data structures and algorithms that let them scale sequence indices to increasingly large sets of experiments, while enabling very fast (approaching interactive) queries.

A recent paper they authored introduced a fundamentally better scheme for storing and retrieving a key component of this information than their previous attempts. (Note: the lead author on the paper was Fatemeh Almodaresi, who just completed her doctoral degree at UMD with Patro as her adviser.)

The researchers discovered that by examining the patterns of occurrence of small pieces of DNA in the repository—and compressing similar, rather than just identical, patterns in a relative manner—they could reduce the size of this component of the data structure by over an order of magnitude, without slowing the query speed, and, in some cases, speeding it up.

“A main technical challenge here is how one can efficiently search for ‘similar’ patterns when each pattern consists of a point in a very high dimensional space,” Patro says. “While initial attempts to solve this problem using a technique known as locality-sensitive hashing seemed not to work well, we were able to exploit the inherent structure of the data, and take advantage of a widely-used structure in genomics—known as the De Bruijn graph—to develop an efficient search structure to find ‘similar’ patterns efficiently.”

Patro’s group continues to work on this problem, providing important improvements to the indexing methodology. Additionally, they have redesigned the largest part of the index, which was previously required to be in the Random Access Memory (RAM) of the computer to be intelligently partitioned. This allows them to load only small parts of the index at a time, vastly reducing the amount of memory required to both build and query the index.

Finally, they are developing a distributed variant of this index that can be partitioned across a network of machines, allowing the query procedure to be scaled up simply by adding more machines to the network and improving robustness.

—Story by Melissa Brachfeld

***

Note: Patro was recently promoted to associate professor. In addition to his tenure appointment, he has an appointment in the University of Maryland Institute for Advanced Computer Studies, where he is a member of the Center for Bioinformatics and Computational Biology.