
From:  Tim Althoff 
Subject:  Re: [igraph] PageRank implementations give vastly different results 
Date:  Tue, 13 May 2014 19:36:55 0700 
Hi Tim,
Thanks for the graph; I'm not done with the investigations, but I already checked some basic things and I have a few comments to make here.
First, don't use Graph.Read_Edgelist to read your file; use Graph.Read_NCOL instead. The problem is that igraph assumes that the edgelist format uses vertex IDs from zero to V1, so you end up creating thousands of isolated vertices. If you use Graph.Read_Ncol, igraph will consider the numbers in the input file as symbolic vertex names and "make up" vertex IDs on its own, assigning the original IDs from the file in the "name" attribute of each vertex.
The reason why I am mentioning this is that ARPACK is thrown off course very easily if there are lots of isolated vertices in the graph (and also this distorts the PageRank results anyway because random teleportations usually end up in one of these isolated vertices). If you use Read_Ncol instead of Read_Edgelist, the results are much more favourable; ARPACK and PRPACK agree in about 2/3 of the cases such that the maximum difference between the PageRank vectors calculated by ARPACK and PRPACK is in the order of 10^{15}. (Ignore the power method  it is deprecated and unsupported, it is left there for sake of compatibility with older versions only). In the remaining ~1/3 of the cases, I suspect that ARPACK fails to converge because the sum of the PageRank scores is not exactly 1.0 for ARPACK (sometimes it is 0.99, but I have even seen 1.02), and the maximum difference between ARPACK and PRPACK is *huge*  in the order of 10^{12}. Further investigation shows that it is probably ARPA
CK to blame because the PageRank vector in ARPACK's case contains negative values  which clearly should never happen.
I think I also know the reason why ARPACK fails to converge in some cases. We have seen it before many times that ARPACK is unreliable if the underlying graph is not strongly connected. PRPACK works around this by calculating PageRank scores separately for each strongly connected component and then combining them in a clever way to get the final PageRank vector. This conjecture seems to be confirmed by the fact that ARPACK and PRPACK agree completely in 1000 out of 1000 trials if I extract the largest strongly connected component of your graph  but that contains only 20 vertices so that's not a big thing on its own.
The only way to actually validate whether PRPACK is correct is to see if the vector generated by PRPACK satisfies the PageRank equation. I have high confidence that this is indeed the case, but I will nevertheless try to confirm it numerically tomorrow or the day after.
The bottom line is:
 ignore the power method completely
 if you must choose one method out of the three right now, definitely choose PRPACK
 if you can wait a bit more, I'll let you know whether I find any issues with PRPACK's results or not
All the best,
Tamas
[Prev in Thread]  Current Thread  [Next in Thread] 