VI-Cut: Finding Biologically Accurate Clusterings in Hierarchical Tree Decompositions Using the Variation of Information
Abstract
Hierarchical clustering is a popular method for grouping together similar
elements based on a distance measure between them. In many cases, annotations
for some elements are known beforehand, which can aid the clustering process.
We present a novel approach for decomposing a hierarchical clustering into the
clusters that optimally match a set of known annotations, as measured by the
variation of information metric. Our approach is general and does not require
the user to enter the number of clusters desired. We apply it to two
biological domains: finding protein complexes within protein interaction
networks and identifying species within metagenomic DNA samples. For these two
applications, we test the quality of our clusters by using them to predict
complex and species membership, respectively. We find that our approach
generally outperforms the commonly used heuristic methods.

(A) Example network clustering. (B) VI-Cut improves the clustering by incorporating node annotations.
In Proceedings of RECOMB 2009, volume 5541, pages 400--417.
http://dx.doi.org/10.1007/978-3-642-02008-7_29
Download the paper.
Code
Download zip file. See README for instructions. Email questions to: saket@cs.umd.edu.
Last modified: February 9, 2010