Re: [Pan-users] Kill files

pan-users
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Pan-users] Kill files

From:	Duncan
Subject:	Re: [Pan-users] Kill files
Date:	Tue, 25 Apr 2017 07:41:24 +0000 (UTC)
User-agent:	Pan/0.142 (He slipped to Sam a double gin; 505bd7027)
Dieter Britz posted on Mon, 24 Apr 2017 12:00:15 +0200 as excerpted:

> People talk about setting up a kill file for posters to news groups that
> annoy others, by off topic postings etc. Is it possible to do that with
> pan?

This repeats the same idea as the replies by HH, DG and Pedro in the 
other subthread, but with a bit more explanation of what pan's actually 
doing and why, and why it's like binary-choice killfiling (killfiled or 
not) but better. =:^)

First, let's understand the difference between a fine-grained scoring 
mechanism like pan has, where if desired the effects of many scoring 
rules can be applied together to arrive at a final score for a post, 
which then can be used to apply some action (like simply hiding the post, 
or marking it read, or deleting it, or on the other end, hilighting it 
with various colors depending on how high it scores, or automatically 
downloading the post to cache, or saving its attachments), vs a hard 
binary or trinary filter mechanism, which will act immediately on the 
first filter that applies to either kill (generally hide and mark-read, 
sometimes delete, depending on the implementation) or not, possibly (the 
trinary case) with the addition of a watch flag (and perhaps auto-
download depending on implementation) if the post isn't killed.

So in pan, a score of -9999 is defined as ignored.  That's what binary 
filters would filter out, also known as killing, thus the term killfile.

And a score of +9999 is defined as watched.

Meanwhile, FWIW, there's a number of other preset score category levels 
as well.  These can be seen under the view menu, header pane.  Here's the 
full listing, lowest to highest:

-9999 (or lower): Ignored

Either multiple scoring rules applied to result in the message being 
ignored, *OR* a single scoring rule set ignored/-9999 and stopped further 
processing of further scoring.

By default pan doesn't display these messages, but doesn't take any other 
action (marking them read, deleting them, etc).

-9998 to -1: Low

The result of one or more scoring rules lowered the message score into 
negative territory, but not enough to make it ignored.

0: Default

Of course 0 is the default score, if no scoring rules apply, or if the 
scoring rules exactly balance each other out.

1 to 4999: Medium

The result of one or more scoring rules was a moderate scoring boost, to 
less than 5000/high, however.

There's an option to display these in a different color, but I don't 
believe it's on by default.  (FWIW I've been running pan since 2002, a 
decade and a half now, and long ago forgot what the defaults were for 
many of the options I've customized.)

5000 to 9998: High

The result of one or more scoring rules was a higher scoring boost, more 
than 4999, but less than 9999.

Again, there's an option to display these in a different color, but I 
don't believe it's on by default.

9999 (or higher): Watched

Either multiple scoring rules resulted in a score at or above 9999, *OR* 
a single scoring rule set it to watched/9999 and stopped further scoring 
rule processing.

Pan should display these in a different color, by default I believe.  
There are options (off by default) that allow auto-downloading or the 
like.


As you should already see, scoring allows a far richer and more nuanced 
setup than arbitrary binary kill/show or trinary kill/show/watch 
filters.  But by using the watched/ignored options only, which basically 
set +9999/-9999 respectively and stop further score processing, you can 
have a simpler binary or trinary setup if you wish.

It's up to you. =:^)

Meanwhile, as I already mentioned, there are choices under view, header 
pane, to match (or not) each of these scoring categories separately.  
Again under view, header pane, pan can then be set to display either 
explicitly matched posts, matched posts and their subthreads, or matched 
posts and their entire threads, as desired.

It's up to you. =:^)

And in the preferences dialog (edit menu, preferences), on the colors 
tab, you can set the colors for each scoring category.

It's up to you. =:^)

(Tho do note that these days, pan only shows those colors in the score 
column, not the entire line as it used to do.  So you have to have the 
score column in your listing or you won't see the colors.  I preferred it 
coloring the entire line, but oh, well, I'm a user, not a dev... and 
unfortunately, that's NOT a user available option.  As I'm writing this, 
however, I'm wondering just how hard it might be to find that and patch 
it to whole line, tho.  I /am/ an advanced enough user that even tho I 
don't claim to be a dev, I can /sometimes/ work out patches on my own, 
and as I run gentoo, I normally build everything from sources and can and 
often do apply my own patches or those I've picked up from others to 
various packages, including pan.  So I'll have to look into patching 
this...)


OK, so you can set whether the various score categories are displayed or 
not, and if displayed, you can set the color per category, but what about 
more practical score-based actions?  In particular, for those who track 
things via marked-read, and who don't have pan's preference to 
automatically mark everything in the group read when they fetch headers 
or leave a group, not displaying ignored posts AND not having them 
automatically marked read is frustrating, because then they hang around, 
still marked unread!

Of course if you've been paying attention, you already know the answer, 
as I mentioned it above.

It is (of course) up to you! =:^)

(Noticing the trend yet? =:^)

Preferences dialog, actions tab.

One possible setup might be:

Delete articles scoring at:     -9999 or less   (ignored)

This would auto-delete ignored articles.

Mark articles read scoring at:  -9998 to -1     (low/negative)

This would auto-mark-read negative/low-scoring articles, but wouldn't 
delete them.  The idea here is to let you hide them by default (by 
showing only unread), but still keep them around in case you see a reply 
and you want to see the message it's replying to.

(I /believe/ it'll mark anything read UNDER the named category as well, 
so it would mark ignored articles read too, if they're not deleted with 
the earlier option, above.  But I'm not actually sure on this bit.)

Alternatively, if you don't delete ignored articles, you can simply mark 
them read, and still show negative/low-scoring articles that aren't 
entirely ignored.

Cache articles scoring at:      1 to 4999       (medium)

Of course you can set this to high/5000-9998 or watched/9999 instead, if 
that fits your needs better.

The idea is that if an article is sufficiently highly scored, you want it 
cached for you so it's already there when you would otherwise have to 
download it to cache.

Do be aware that pan's cache size is pretty small, 10 MB by default, and 
especially if you're doing binaries and using this setting, you'll 
probably want a larger cache.  That's set in preferences, on the behavior 
tab.

(Again, I /believe/ it'll do the same with the higher categories, high 
and watched, too, but I've not actually tested it to be sure.)

Download attachments of articles scoring at:    Disabled

If you're doing binaries, you might want to set this instead of the cache 
option.

Generally, people download binaries using one of two strategies.

Here, I prefer to have pan's cache set way big, and download messages to 
cache first, so they're local.  Then when they're already cached so I 
won't be waiting for the download, I can go thru and sort out what I 
really want, saving it where I want it, and deleting what I don't really 
want.  This works best for (relatively) small binaries that you will 
download many hundreds or thousands of, like still images or audio clips 
mostly under 10 minutes in length, with the occasional longer audio clip 
or short video.  It also requires a much larger cache setting (on the 
order of gigabytes, for me), or pan will start deleting previously 
downloaded to cache but still unread messages, to make room for the 
newest still downloading to cache messages.

For that binaries strategy or for text messages, the auto-download-to-
cache action exists.  Just be aware of the cache size requirements and 
adjust it accordingly.

The other strategy, which is obviously pan's default given the very small 
10 MB default cache size, is to have pan download and save off the 
binaries immediately, without caring at all about the messages they're 
attached to.  Because the attachments are saved immediately and the 
messages they were attached to don't matter, those messages can be 
deleted from cache as soon as the attachment is saved, so this requires a 
far smaller cache and pan's default 10 MB cache suffices.

This works best for very large binaries, typically half-hour or longer 
videos like TV series episodes or feature-length movies.  It works best 
if you don't care about the messages containing the attachments at all 
(no discussion of the series, etc), since unless you increase the size of 
the cache anyway, they'll be deleted effectively immediately after the 
attachment processing is completed.

It is for this binaries strategy that the auto-download-(and-save)-
attachments action exists.  Obviously this isn't going to work too well 
if your interest is primarily text groups (and people post binaries there 
too, and the messages score high enough for the action to trigger), 
because you'll end up with a bunch of random binaries that happened to be 
attached to watched or whatever level scoring messages saved off to 
wherever you have pan saving them.


OK, but what about the scoring itself?

First of all, the watch (thread) and ignore (thread or author) entries on 
the articles menu are the GUI method to create scoring rules that set the 
+/-9999 score and abort further score processing.

Next, there's the edit article's watch/ignore/score and add a scoring 
rule entries, again on the articles menu.  These bring up a dialog, 
either directly (for add) or indirectly (for edit, using the add button 
there), that lets you setup a more detailed scoring rule.  This is more 
flexible than the arbitrary watch/ignore options above, allowing you to 
match various options and if matched either set a specific score and 
abort further scoring as the above watch/ignore options do, or 
alternatively, to simply add/subtract whatever score and continue 
processing further scoring rules.  You can also set an expiry for the 
rule, if desired, or make it permanent.

It's this last option, to add/subtract some score value and continue 
processing more scoring rules, that's where the real flexibility comes 
in.  You can match on multiple subject keywords in multiple rules, adding 
or subtracting based on the match, then add/subtract based on author, 
then do some more based on references (effectively thread, only sometimes 
message-ids are deleted from the header and it won't match the thread any 
longer), then subtract points if it's cross-posted/spammed to too many 
groups, and add or subtract more points based on size in bytes or line 
count.

As long as no match sets an arbitrary score and stops further processing, 
all these matches will result in a final score that combines the effects 
and the relative scoring weight of all the others, and pan uses that 
final score to decide what scoring category the message belongs in, and 
thus whether to show it and how, as well as what automated actions to 
apply.

See how much richer a good scoring system is, compared to arbitrary 
binary/trinary-choice filtering on just ONE match-factor?

Of course if that's too complex for you, just use the watch/ignore and be 
done with it.

It's up to you. =:^)


Meanwhile, as the others suggested, the real advanced stuff is reserved 
for those who choose to directly edit the scorefile itself.  They posted 
the link to the format description.

http://www.slrn.org/docs/score.txt

But, keep in mind that the link above is for a different news client, 
slrn, which shares a general scorefile format with pan.  Unfortunately, 
however, pan's score-processing code isn't quite as advanced as slrn's, 
so some of the more complex stuff described there doesn't work in pan.  
Pan hasn't implemented the include statement, for instance, so don't try 
to use it.  The {} grouping logic isn't implemented either, AFAIK.

And, pan hasn't implemented the score keyword's single-colon AND logic, 
so single or double colon doesn't matter, it's always interpreted as OR 
(double-colon).  This is unfortunate, but the effect can be partially 
counteracted by simply creating multiple conditions, each of which gives 
partial points.  So instead of an AND score with five conditions to meet 
and a +1000 value, you can use pan's OR scoring on each of the five 
conditions, with a +200 value on each.  The total if all match will still 
be +1000, but of course the effect might be less anticipated if only some 
conditions match and that interacts with another would-be compound with 
only some conditions matching.

Another difference is that pan's scoring matches are always case 
insensitive.  So don't worry about John vs. JOHN vs. john vs. JoHN, the 
same regex will match them all without any fancy regex footwork.


Some additional scorefile format notes:

* Unfortunately for some, understanding regular expressions is really 
necessary to take full advantage of scoring, particularly when editing 
the scorefile itself, but it's worth it, and pan's GUI does allow simple 
scoring even if you don't know regex.

It's up to you. =:^)

* The note in section 1.1 recommending that one stick to the overview 
headers (typically subject/from/date/message-id/references/bytes/lines 
and often xref), but allowing others, most definitely applies.  
Unfortunately it's a technical limitation of the protocol, not something 
pan (or slrn or any other news client) can do anything about.

The thing is that pan can score headers in the overview without 
downloading the full message (or full headers).  For the most part, 
that's the headers needed to display the message in the headers pane, 
author, subject, date, etc, plus message-id and references for threading 
and tracking across multiple servers, etc.  But for the more exotic 
headers, pan won't get them, and thus can't score them, until the article 
is downloaded to cache.

So if you have an abuser that keeps nym-shifting and otherwise 
deliberately changing everything in the headers he has access to, in 
ordered to try to avoid killfiling, but who always posts thru a provider 
that adds an xtrace header with a consistent value you can score on, you 
*CAN* score on it, but you'll have to download the messages to cache 
first.

Take it from someone who was in the position of trying to killfile a 
poster like that at one point, before pan could score such non-overview 
headers, being able to ignore-score it, but only after downloading to 
cache, sucks, but it definitely sucks less than having to actually show 
the message in ordered to see who it is and block it!


* Note that while you can set an expiry on the score in the pan GUI, and 
at that point pan will indeed quit applying that score, it won't actually 
remove it from the scorefile.  The only way to actually remove the score 
from the scorefile is to manually edit it.

Unfortunately, this does mean that if you actively add expiring scoring 
rules and never manually remove them, eventually your scorefile will be 
cluttered with perhaps hundreds or thousands of expired rules and they'll 
begin to affect score-file loading performance as pan still has to 
process them at least far enough to see they're expired, and then how far 
to ignore until the beginning of the next possibly still valid rule.

So you'll probably want to either clear out the scorefile and start new 
occasionally, or manually edit it to at least clean out the expired rules 
from time to time, or simply don't use expiring scores, just living with 
it unless it's worth a permanent rule.

* Yes, an initial % on a line *DOES* mean it's a comment.

By implication, most of the lines pan adds when you add a score via the 
GUI are comments and don't matter for the actual scoring at all.  They're 
only there to aid human readers.

Of course that means you can edit or delete them as you wish, without 
affecting actual operation.

Here, I tend to delete pretty much all of pan's added comments, with the 
exception of the date added comments for expiring scores, since that way 
I can see how long I had set the expiry.

* If you do heavy scoring with lots of rules, using pan's GUI to set them 
up isn't particularly efficient for machine processing.  The example in 
the linked documentation is somewhat more efficient, but it's too short 
to really get the point across.  If you're planning to do a lot of manual 
scorefile editing or simply want to make your scorefile more efficient, 
either check past scoring threads for this list/group (the list is 
available as a newsgroup on news.gmane.org) where I've posted a longer 
example from my scorefile, or ask for such an example.

* Similarly, if you're not good with regular expressions and need some 
help designing a score that's more complex than you can easily do with 
the pan GUI, or if something's just not working as you expected it to, 
with scoring or something else, ask for help.  We've dealt with a number 
of such queries over the years. =:^)


OK, so hope that's of help.  Some people just want an answer to plug in 
without understanding it.  Others want to understand what's going on, so 
next time they want to do something similar but not identical, they can 
figure out how to do it themselves.  I'm certainly in this latter group, 
and my posts tend to go to the extreme in explaining things.  That 
frustrates the first group, but I've stacks of thanks from people who 
preferred the better understanding my explanatory if extremely verbose 
style gave them, and sometimes I get new insights or ideas (like possibly 
patching the score coloring to the whole line instead of just the score 
column, above) as I'm writing things down, and it's the combination of 
both of those that's my motivation to keep posting as I do. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman
[Prev in Thread]
Current Thread
[Next in Thread]
[Pan-users] Kill files, Dieter Britz, 2017/04/24
- Re: [Pan-users] Kill files, Holger Hoffstätte, 2017/04/24
  - Message not available
    - Re: [Pan-users] Kill files, Pedro, 2017/04/24
- Re: [Pan-users] Kill files, Duncan <=
  - Re: [Pan-users] Kill files, Dieter Britz, 2017/04/25
  - Re: [Pan-users] Kill files, mick, 2017/04/25
Prev by Date: Re: [Pan-users] Kill files
Next by Date: Re: [Pan-users] Kill files
Previous by thread: Re: [Pan-users] Kill files
Next by thread: Re: [Pan-users] Kill files
Index(es):
- Date
- Thread