[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Pan-users] RFC: Detecting multiparts (was: .94 weirdness with detec

From: Chris Petersen
Subject: Re: [Pan-users] RFC: Detecting multiparts (was: .94 weirdness with detecting attachments)
Date: 08 Aug 2003 12:04:33 -0700

>  * likely_binary_group is true if the newsgroup name contains
>    any of: "binaries", "fan", "mag", "sex", false otherwise

don't forget the plethora of misspelled ones, too..  binaires, etc.

>  * likely_binary_subject is true if the Subject: header contains
>    any of: "jpeg" "jpg" "gif" "tiff" "png", false otherwise

avi, ogm, mpe?g, mp[23], etc, etc...   how about perlre:  \w\.\w{2,4}

>  * part = 0, or if either "(x/y)" or "[x/y]" is in Subject:, then x.
>    (Work backwards from the end of the string, in case someone's
>    posting a set of multiparts and (x/y) appears in the Subject: 
>    twice)

also:   x of y

>   4. if is_binary is true,
>      and is_reply is true,
>      and the part is 0 or 1,
>      then it's probably a follow-up to a multipart (I've never seen a 
> followup to a part > 1).
>      set is_binary to false.
>      UNLESS: once in a blue moon people will post binaries as follow-ups, so 
> hedge our bets:
>      leave is_binary as true if lines > 500.

this is problematic, since people post binaries as replies to REQ
messages (which would probably end up counting as part=0) all the time.

number of lines is also problematic.  I often run across binary articles
that show as having 0 or 2 or 10 or some-other-small-number of lines,
but have 100k+ of data in them.

Other than that (not knowing how pan does it now), it all looks GREAT.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]