[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [igraph] PageRank implementations give vastly different results

From: Tamás Nepusz
Subject: Re: [igraph] PageRank implementations give vastly different results
Date: Wed, 14 May 2014 00:22:55 +0200

Hi Tim,

Thanks for the graph; I'm not done with the investigations, but I already 
checked some basic things and I have a few comments to make here.

First, don't use Graph.Read_Edgelist to read your file; use Graph.Read_NCOL 
instead. The problem is that igraph assumes that the edgelist format uses 
vertex IDs from zero to |V|-1, so you end up creating thousands of isolated 
vertices. If you use Graph.Read_Ncol, igraph will consider the numbers in the 
input file as symbolic vertex names and "make up" vertex IDs on its own, 
assigning the original IDs from the file in the "name" attribute of each vertex.

The reason why I am mentioning this is that ARPACK is thrown off course very 
easily if there are lots of isolated vertices in the graph (and also this 
distorts the PageRank results anyway because random teleportations usually end 
up in one of these isolated vertices). If you use Read_Ncol instead of 
Read_Edgelist, the results are much more favourable; ARPACK and PRPACK agree in 
about 2/3 of the cases such that the maximum difference between the PageRank 
vectors calculated by ARPACK and PRPACK is in the order of 10^{-15}. (Ignore 
the power method -- it is deprecated and unsupported, it is left there for sake 
of compatibility with older versions only). In the remaining ~1/3 of the cases, 
I suspect that ARPACK fails to converge because the sum of the PageRank scores 
is not exactly 1.0 for ARPACK (sometimes it is 0.99, but I have even seen 
1.02), and the maximum difference between ARPACK and PRPACK is *huge* -- in the 
order of 10^{12}. Further investigation shows that it is probably ARPA
CK to blame because the PageRank vector in ARPACK's case contains negative 
values -- which clearly should never happen.

I think I also know the reason why ARPACK fails to converge in some cases. We 
have seen it before many times that ARPACK is unreliable if the underlying 
graph is not strongly connected. PRPACK works around this by calculating 
PageRank scores separately for each strongly connected component and then 
combining them in a clever way to get the final PageRank vector. This 
conjecture seems to be confirmed by the fact that ARPACK and PRPACK agree 
completely in 1000 out of 1000 trials if I extract the largest strongly 
connected component of your graph -- but that contains only 20 vertices so 
that's not a big thing on its own.

The only way to actually validate whether PRPACK is correct is to see if the 
vector generated by PRPACK satisfies the PageRank equation. I have high 
confidence that this is indeed the case, but I will nevertheless try to confirm 
it numerically tomorrow or the day after.

The bottom line is:

- ignore the power method completely
- if you must choose one method out of the three right now, definitely choose 
- if you can wait a bit more, I'll let you know whether I find any issues with 
PRPACK's results or not

All the best,

reply via email to

[Prev in Thread] Current Thread [Next in Thread]