[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [igraph] Compare clusterings (partitions of a set)

From: Stijn van Dongen
Subject: Re: [igraph] Compare clusterings (partitions of a set)
Date: Mon, 26 Jul 2010 00:16:19 +0100
User-agent: Mutt/

      Hi all,

On Sat, Jul 24, 2010 at 09:49:08AM -0700, Richard Bowman wrote:
> Hi Albert,
> Tamas and Stijn are correct: it is definitely not a trivial problem. I
> did this in my dissertation with the "clue" cluster ensemble R package
> and the cl_agreement(...) and cl_dissimilarity(...) functions, which
> provide a large range of choices for comparing two partitions of a
> set, which is what the membership vectors are. I chose the Jaccard and
> corrected Rand similarities, but most methods give very similar
> results, qualitatively. Review the literature referenced in the manual
> for information about what the different measures mean, because there
> are many ways to compare. I attached some code for this below.
> Code:
> library("clue")
> x <- as.hard.partition(c(1,1,1,1,2,2,2,2))
> y <- as.hard.partition(c(3,3,3,3,5,5,5,5))
> z <- as.hard.partition(c(1,2,3,4,5,6,7,8))
> cl_agreement(x,y,method = cRand)
> cl_agreement(x,z,method = cRand)
> cl_agreement(y,z,method = cRand)

This is something I am quite interested in. There are two metric dissimilarities
(also called metrics or distances) between clusterings that I think are very
useful.  One is the 'variation of information' measure by Meila, also provided
in cl_dissimilarity.

The other is the lesser known 'split/join' distance by myself, not provided by
cl_dissimilarity. What is interesting in the current context is that this
split/join distance has a very basic interpretation as the number of nodes that
need to be moved to obtain one clustering from the other.  The split/join
distance has been analysed as part of a cohort of measures by Meila in one or
two of her papers. See e.g.

   [1] Marina  Meila.  Comparing Clusterings - An Axiomatic View.  In 
   of the 22nd International Conference on Machine Learning, Bonn, Germany, 

My take (largely aligned with Meila's I believe) is this:

-  Distances between clusterings are preferable to indices and similarities. It
   is important that the triangle inequality is satsified. I have never really
   understood measures that yield zero under some null model. The set of
   clusterings forms a beautiful lattice structure, and informative metric
   distances exist that are aligned with that lattice (see [1]).  Basically,
   this leaves the Mirkin distance, the Variation of Information distance, and
   the split/join distance (again [1]).

-  Measures and distances based on counting pairs are hugely affected by the 
sizes of
   the clusters involved. This makes it very hard to use them to validate and 
   uniformly across different granularity levels, and leads to counterintuitive 
   in extreme cases. This crosses out the Mirkin distance.

-  The variation of information measure is more informative than the split/join
   distance, as it takes into account the full information present in the
   contingency table.  In comparison, the split/join distance only takes into
   account for each cluster C in the first clustering the best matching cluster
   in the second clustering (and vice versa), disregarding the presence and
   possible fragmentation of other matches.

-  The split/join distance is easily interpretable as the number of
   nodes that need changing to obtain one clustering from the other, and
   fragmentation is in practice uncommon.

In conclusion, I usually opt for either the variation of information distance
or the split/join distance.


P.S. the split/join distance was introduced in this technical report:

   [2] Stijn  van  Dongen.  Performance  criteria for graph clustering and 
Markov cluster experiments. Technical
   Report INS-R0012, National Research Institute for Mathematics and Computer 
Science  in  the  Netherlands,
   Amsterdam, May 2000.

Stijn van Dongen         >8<        -o)   O<  forename pronunciation: [Stan]
EMBL-EBI                            /\\   Tel: +44-(0)1223-492675
Hinxton, Cambridge, CB10 1SD, UK   _\_/   http://micans.org/stijn

reply via email to

[Prev in Thread] Current Thread [Next in Thread]