igraph-help
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [igraph] PageRank implementations give vastly different results


From: Tim Althoff
Subject: Re: [igraph] PageRank implementations give vastly different results
Date: Tue, 13 May 2014 19:36:55 -0700

Hi Tamas,

thanks for looking into it and your comprehensive response.

Is there any way to tell PRPACK what "epsilon" I want in the optimization. I verified whether the actual values satisfy the PR equation. They are not totally off but the difference between what they should be (given their neighbors) and what it actually is is 10^-5 but the PR values themselves are of the order 10^-5. So the relative errors seem pretty high.
Can I influence that precision? It's super fast anyway (100k nodes).

Thanks,
Tim


On Tue, May 13, 2014 at 3:22 PM, Tamás Nepusz <address@hidden> wrote:
Hi Tim,

Thanks for the graph; I'm not done with the investigations, but I already checked some basic things and I have a few comments to make here.

First, don't use Graph.Read_Edgelist to read your file; use Graph.Read_NCOL instead. The problem is that igraph assumes that the edgelist format uses vertex IDs from zero to |V|-1, so you end up creating thousands of isolated vertices. If you use Graph.Read_Ncol, igraph will consider the numbers in the input file as symbolic vertex names and "make up" vertex IDs on its own, assigning the original IDs from the file in the "name" attribute of each vertex.

The reason why I am mentioning this is that ARPACK is thrown off course very easily if there are lots of isolated vertices in the graph (and also this distorts the PageRank results anyway because random teleportations usually end up in one of these isolated vertices). If you use Read_Ncol instead of Read_Edgelist, the results are much more favourable; ARPACK and PRPACK agree in about 2/3 of the cases such that the maximum difference between the PageRank vectors calculated by ARPACK and PRPACK is in the order of 10^{-15}. (Ignore the power method -- it is deprecated and unsupported, it is left there for sake of compatibility with older versions only). In the remaining ~1/3 of the cases, I suspect that ARPACK fails to converge because the sum of the PageRank scores is not exactly 1.0 for ARPACK (sometimes it is 0.99, but I have even seen 1.02), and the maximum difference between ARPACK and PRPACK is *huge* -- in the order of 10^{12}. Further investigation shows that it is probably ARPA
CK to blame because the PageRank vector in ARPACK's case contains negative values -- which clearly should never happen.

I think I also know the reason why ARPACK fails to converge in some cases. We have seen it before many times that ARPACK is unreliable if the underlying graph is not strongly connected. PRPACK works around this by calculating PageRank scores separately for each strongly connected component and then combining them in a clever way to get the final PageRank vector. This conjecture seems to be confirmed by the fact that ARPACK and PRPACK agree completely in 1000 out of 1000 trials if I extract the largest strongly connected component of your graph -- but that contains only 20 vertices so that's not a big thing on its own.

The only way to actually validate whether PRPACK is correct is to see if the vector generated by PRPACK satisfies the PageRank equation. I have high confidence that this is indeed the case, but I will nevertheless try to confirm it numerically tomorrow or the day after.

The bottom line is:

- ignore the power method completely
- if you must choose one method out of the three right now, definitely choose PRPACK
- if you can wait a bit more, I'll let you know whether I find any issues with PRPACK's results or not

All the best,
Tamas





reply via email to

[Prev in Thread] Current Thread [Next in Thread]