[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Pan-devel] Replace custom data structures with standard ones?

From: Duncan
Subject: Re: [Pan-devel] Replace custom data structures with standard ones?
Date: Mon, 11 Feb 2013 12:51:55 +0000 (UTC)
User-agent: Pan/0.140 (Chocolate Salty Balls; GIT 2e9e07c /usr/src/portage/src/egit-src/pan2)

Rui Maciel posted on Mon, 11 Feb 2013 10:45:00 +0000 as excerpted:

> A significant number of custom/non-standard data structures are used in
> Pan.  In some cases these non-standard data structures are apparently
> redundant and used without adding much value (i.e., Loki::AssocVector Vs
> std::map, pan::sorted_vector Vs std::set).   Also, C++11 introduced a
> couple of standard data structures (std::array, std::unordered_map).
> Considering this, and considering that these custom data structures date
> back to a time when C++ standard containers were still freshly
> implemented and therefore were notorious for their, say, non-optimal
> performance, and that including them affects the project's
> maintainability, would Pan's maintainers be open to the idea of
> replacing them with their C++ standard counterparts?

While the fact is I'm not enough of a coder to go far beyond handwavy, 
I've been following pan for over a decade now (my first posts to the pan 
lists, at least that gmane has archived, were 2H2002)...

The biggest concern is by far scalability.  Pre-C++-rewrite-pan (thru 
0.14.x, the rewrite was introduced with 0.90) did use much more standard 
gtk widget data interfaces, but it had severe scalability problems with 
groups with over a couple hundred k-headers.  AFAIK with a bit of 
tweaking over the years, that rose to a couple million headers, but it 
was stuck there, even as memory and CPU resources increased, because it 
simply hit a barrier at about that point and would sit for an hour or 
longer trying to digest headers, above that.

With the C++ rewrite introduced with 0.90, Charles scrapped much of the 
earlier approach and introduced the current custom scheme, with its 
symbolic shorthand approach for selected common field strings, to 
conserve memory.  That allowed pan to scale to typically tens of millions 
of headers per group, and to keep to near-linear scaling up to hundreds 
of millions of headers per group given the memory to do so, I believe.  
However, someone calculated that for example with the giganews retention, 
on some of their groups, pan would still need something like 12 GB RAM 
(or was it 20... I remember it was well above the 8 gig I had at the 
time... I've 16 gig now but retention has continued increasing as 
well)... obviously well beyond 32-bit (that was the context of that 
discussion, someone running 32-bit was wondering why pan was crashing, it 
was simply running out of memory!), and a bit of an issue even for 
today's typical 64-bit machine, where 16 gig is still considered quite 
high end.

Really, the sticking point is that pan constructs its entire tree in-
memory.  There has long been talk of doing a database backend using sqlite 
or mysql or the like, such that pan only works with a few pages worth of 
data in memory at once, the rest is in the db, but I think Charles wasn't 
a db expert and hesitated to go there.  Now of course Charles has moved 
on and it's Heinrich that has been doing most of the new features and 
heavy coding of late.  He too has mentioned that he plans a db backend at 
some point, but I've no clue whether he has a branch he's hacking on to 
that end yet or not...

What I'm saying is... switching to standard data structures and etc may 
have value, particularly if it's even better memory scaling than current, 
but be absolutely sure you keep the scaling in mind, anticipating and 
testing new code against sufficiently busy groups on servers with enough 
retention to give you at least 200 million header groups to test against.

But any real serious data structure rewrite work should almost certainly 
target a database backend solution of some sort as in reality, that's 
about the only way to address the whole scalability thing once and for 
all.  While I have no idea where Heinrich might be on the database backend 
work, I'd guess he at /least/ has some rough ideas about implementation, 
so it'd probably be a good idea to discuss that with him (and there's a 
couple others who might be interested) before diving in head first.

Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

reply via email to

[Prev in Thread] Current Thread [Next in Thread]