igraph-help
[Top][All Lists]

## Re: [igraph] Compare clusterings (partitions of a set)

 From: Stijn van Dongen Subject: Re: [igraph] Compare clusterings (partitions of a set) Date: Mon, 26 Jul 2010 00:16:19 +0100 User-agent: Mutt/1.4.2.2i

```      Hi all,

On Sat, Jul 24, 2010 at 09:49:08AM -0700, Richard Bowman wrote:
> Hi Albert,
>
> Tamas and Stijn are correct: it is definitely not a trivial problem. I
> did this in my dissertation with the "clue" cluster ensemble R package
> and the cl_agreement(...) and cl_dissimilarity(...) functions, which
> provide a large range of choices for comparing two partitions of a
> set, which is what the membership vectors are. I chose the Jaccard and
> corrected Rand similarities, but most methods give very similar
> results, qualitatively. Review the literature referenced in the manual
> for information about what the different measures mean, because there
> are many ways to compare. I attached some code for this below.
>
> Code:
> library("clue")
>
> x <- as.hard.partition(c(1,1,1,1,2,2,2,2))
> y <- as.hard.partition(c(3,3,3,3,5,5,5,5))
> z <- as.hard.partition(c(1,2,3,4,5,6,7,8))
>
> cl_agreement(x,y,method = cRand)
> cl_agreement(x,z,method = cRand)
> cl_agreement(y,z,method = cRand)

This is something I am quite interested in. There are two metric dissimilarities
(also called metrics or distances) between clusterings that I think are very
useful.  One is the 'variation of information' measure by Meila, also provided
in cl_dissimilarity.

The other is the lesser known 'split/join' distance by myself, not provided by
cl_dissimilarity. What is interesting in the current context is that this
split/join distance has a very basic interpretation as the number of nodes that
need to be moved to obtain one clustering from the other.  The split/join
distance has been analysed as part of a cohort of measures by Meila in one or
two of her papers. See e.g.

[1] Marina  Meila.  Comparing Clusterings - An Axiomatic View.  In
Proceedings
of the 22nd International Conference on Machine Learning, Bonn, Germany,
2005.

My take (largely aligned with Meila's I believe) is this:

-  Distances between clusterings are preferable to indices and similarities. It
is important that the triangle inequality is satsified. I have never really
understood measures that yield zero under some null model. The set of
clusterings forms a beautiful lattice structure, and informative metric
distances exist that are aligned with that lattice (see [1]).  Basically,
this leaves the Mirkin distance, the Variation of Information distance, and
the split/join distance (again [1]).

-  Measures and distances based on counting pairs are hugely affected by the
sizes of
the clusters involved. This makes it very hard to use them to validate and
compare
uniformly across different granularity levels, and leads to counterintuitive
results
in extreme cases. This crosses out the Mirkin distance.

-  The variation of information measure is more informative than the split/join
distance, as it takes into account the full information present in the
contingency table.  In comparison, the split/join distance only takes into
account for each cluster C in the first clustering the best matching cluster
in the second clustering (and vice versa), disregarding the presence and
possible fragmentation of other matches.

-  The split/join distance is easily interpretable as the number of
nodes that need changing to obtain one clustering from the other, and
fragmentation is in practice uncommon.

In conclusion, I usually opt for either the variation of information distance
or the split/join distance.

best,
Stijn

P.S. the split/join distance was introduced in this technical report:

[2] Stijn  van  Dongen.  Performance  criteria for graph clustering and
Markov cluster experiments. Technical
Report INS-R0012, National Research Institute for Mathematics and Computer
Science  in  the  Netherlands,
Amsterdam, May 2000.

--
Stijn van Dongen         >8<        -o)   O<  forename pronunciation: [Stan]
EMBL-EBI                            /\\   Tel: +44-(0)1223-492675
Hinxton, Cambridge, CB10 1SD, UK   _\_/   http://micans.org/stijn

```