The Power of Protein Interaction Networks for Associating Genes with Diseases
Understanding the association between genetic diseases
and their causal genes is an important problem concerning human health. With
the recent influx of high-throughput data describing interactions between
gene-products, scientists have been provided a new avenue through which these
associations can be inferred. Despite the recent interest in this problem,
however, there is little understanding of the benefits and drawbacks underlying
the proposed techniques.
We assess the utility of physical interactions for
determining disease-gene associations by examining the performance of seven
recently developed computational methods (plus several of their variants).
We find that random-walk based approaches individually outperform
clustering-and neighborhood-based approaches, although most methods make
predictions not made by any other method. We show how
combining these methods into a consensus method yields Pareto
optimal performance. We also quantify how a diffuse topological distribution of
proteins negatively affects the quality of predictions and are thus able to
identify diseases especially amenable to network-based predictions and others
for which additional information sources are absolutely required.
The disease-gene associations outputted by each algorithm tested are available for download.
Each file is named according to the following format: [METHOD]-[NETWORK]-[LOCUS]
For example, oti1-HPRD-LOC includes all predictions made by the oti1 algorithm on the HPRD interaction network using linkage intervals.
- The METHOD corresponds to the algorithm (and variant) used: neighborhood, oti1 (+ our
variants: oti2 and oti3 each embedded in the 'oti' file), gs1 (+ our variants
gs2, gs3, and
gs-all, each embedded in the 'gs' file), mcl, vicut, rw, or prop.
- The NETWORK corresponds to the protein interaction network used: HPRD (The Human Protein Reference Database) or OPHID (The Online Predicted Human Interaction Database).
- The LOCUS corresponds to whether linkage intervals were taken into account when making a prediction: LOC (assuming linkage intervals are known) or NOLOC (unknown).
We only included higher-confidence NOLOC predictions that had
a score above a threshold. The full set of predictions are available upon
There are 5 tab-delimited columns in each file corresponding to:
[METHOD] [GENE] [DISEASE] [GENE] [CORRECT/INCORRECT] [A(p,d)
Download the HPRD and OPHID PPI networks, and the Gene-Disease OMIM associations, used in our paper.
Last modified: October 14, 2009