[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Pan-users] Re: Creating an local archive of subscribed groups?
From: |
Duncan |
Subject: |
[Pan-users] Re: Creating an local archive of subscribed groups? |
Date: |
Mon, 23 Aug 2010 06:48:58 +0000 (UTC) |
User-agent: |
Pan/0.133 (House of Butterflies; GIT a971f44 branch-testing) |
Jurgen Defurne posted on Sun, 22 Aug 2010 18:10:14 +0200 as excerpted:
> I am a regular user of Pan for some high technical newsgroups.
>
> What I would like is to have the contents of these groups as a local
> archive which can be searched using Pan.
>
> I have already tried two ways to do this. The first one was using 'Cache
> Article' after selecting all articles, but it seems that when the cache
> gets beyond a certain size, older cached articles disappear.
>
> I am now trying with 'Save Articles...', but this creates one file,
> which cannot be incrementally updated.
>
> What other (simple, preferably) possibilities do there exist, not
> necessarily using Pan for storage, but certainly for reading and
> searching?
You're running into pan's default cache size limit, 10 MB. That setting,
like several others, *IS* available in the config files, but is not made
available in the GUI, basically because while pan only requires gtk+, it's
a gnome family app, and gnome in general caters to the "simple" users who
are apparently afraid of too many config options, even when they'd be
seriously useful for some users! (FWIW, that's one reason that despite
all the problems with kde4, I'm still a kde user -- kde's comparable
policy is to create a generally sane default, but expose far more options
in the configuration for those who wish to use them. But knode doesn't
handle binaries as well as pan does and klibido handles binaries but not
text, and I'm not sure if it was ported to kde4, either, so pan it is.)
Anyway, desktop environment politics aside...
As you may know, pan's config and data are stored in ~/.pan2/ by default.
In that directory (or whatever one you have pan's files stored in, if
you've made use of the PAN_HOME environmental variable to point pan at a
different location, find preferences.xml. As usual, if you're going to
edit config files, do so with the app you're editing the config for, pan
in this case, closed.
In preferences.xml, the preferences are grouped by type, and then
alphabetically by name. Look for type int, name "cache-size-megs".
Make it whatever integer number of megs you like.
Here, I make use of the PAN_HOME environmental variable I mentioned to run
multiple pan "instances", each pointed at a different data dir. The way I
have it setup, I have one for text groups, one for binaries, and a third
for testing, but of course, you can split it up however you like. I
mention this by way of explaining how and why I have multiple
preferences.xml files, each with a different cache size.
For my text groups instance, I have:
<int name='cache-size-megs' value='5120'/>
Since those groups are mostly text and I've set the expiration to none for
the servers in that instance, I have posts going back years in some groups
(to when the pan C++ rewrite was introduced with 0.90, as it changed file
formats for a number of things, actually, back further than that on some
gmane.org groups/lists, gmane of course being a list2news archive and
gateway, presenting a whole bunch of mailing lists as newsgroups, with
unexpiring posts), and the cache is still only ~2 gig, so I'm a long way
from maxing it out.
The test instance is I think still at default. I use a separate test
instance so I can visit groups without subscribing, say if someone reports
a problem post that I want to try, and not have pan storing information
about groups I don't really care about and am not subscribed to, in my
other instances.
The binaries instance has a cache on a dedicated 12 gig partition, so I've
set its cache size to an arbitrary number, a bit above 12 gigs.
<int name='cache-size-megs' value='12500'/>
And while I've not actually done binaries in some time (it seems I've just
too many other things I find interesting to do, and just never get to it),
I have actually tested that 12 gig a few times, some years ago. Pan
handles it fine, or at least did, back then.
So provided you set unexpiring for your server(s), you shouldn't have a
problem setting a cache size into the double-digit gigs if necessary, or
maintaining an archive going back as far as you can get messages, without
them expiring locally, just because they expire on whatever server you're
using.
The one caveat I have noticed is that the more data you keep around, the
longer pan takes to load up, especially from cold disk cache. My way
around that has been to assign pan its own dedicated desktop (kwin allows
you to configure specific apps to always appear on a specific desktop, and
that's what I do with pan), and to start it when I start X/KDE, keeping it
running pretty much all the time I'm in X, so it only shuts down when I
shut down X/KDE. If you like, you can put pan on its own partition, and
periodically back it up, then wipe the partition and copy everything back,
thus defragging it, speeding up initial load.
Also, I run a 4-disk kernel/md RAID-1 now, but previously ran a RAID-6,
which with four spindles, is effectively two-way striped for read access.
To my surprise, reading multiple files as is the case when pan is loading,
the kernel is good enough at scheduling parallel I/O on the RAID-1 that it
NOTICEABLY shrank my load time when I switched to that, as compared to the
RAID-6. I had thought that the RAID-6 would be faster due to the
effective two-way-striping for read access, but I was wrong, the kernel's
good enough at scheduling on the RAID-1 that it apparently keeps all four
disks reading data in parallel, so pan loads faster from mirrored RAID
than from striped RAID.
....
That's one option, all-pan. The other option would be to run a personal
news-server installation, like leaf-node. Leaf-node would download the
messages to your local disk and store them there, then serve them locally
to pan. Doing it this way, you could leave pan's cache size untouched (or
maybe even shrink it), and point it at your local server instead of the
remote. You'd still set pan not to expire articles, so it'd keep its
article index intact, but it wouldn't need a big cache, since it's pulling
from the local leaf-node (or whatever) server anyway. You'd then set
leafnode to unexpiring as well, so it continued to retain articles back as
far as you could get.
One advantage to this, if you're doing enough binaries that you're waiting
on pan to download, anyway, is that the local server would presumably be
running all the time in the background, downloading messages as they came
in, so they'd always be available virtually instantly from pan, since
they're already stored locally. No waiting on the network connection to
the server.
But that's probably not that significant an issue unless you're still on
an analog modem dialup connection, because anything much faster than that,
and if you're downloading enough data that you're waiting on the network,
you'll quickly be looking for more room for your archive -- which will
soon measure in terabytes, not gigabytes.
However, there's another possible advantage, as well. Pan's loadup should
be faster if it's only caching the default 10 MB.
.....
Meanwhile, pan does have one serious limitation, in terms of search (and
of scoring). It only scores and searches the message overviews --
basically, the information in the header pane, author, subject, etc (tho
message-ids are also in the overviews and form the basis for the watch/
kill/score thread feature). Pan is unable to score or search on actual
message content. If you're happy with pan's searching already, and just
need a larger cache to search on, that's fine. But do be aware that if
you do want/need to search on message content, you'll need to use
something else.
Of course, with kde's nepomuk/strigi indexing (what I'm familiar with
since I run kde), or beagle (AFAIK the gnome indexer), or google-desktop
indexing, or whatever, you can point that at either the pan cache (for
option one above) or leaf-node's cache (for option two), and get full
content indexing, if that's what you want. You can then open the file
using whatever editor you have associated, find the subject and date info,
and use pan to view the whole thread in context, if desired.
So pan can still be used to view the thread, once you find a post that
interests you based on content. It's just that if you do want full post
content search, not just subject/author search, you'll need to use
something else for the initial search, and can then open the thread in pan
if you like. If that's a limitation you can live with, great. Otherwise,
you should probably look for a different news client, one with full post
search capability.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman