bsf-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: last [real life] meeting's notes


From: Aldrin Martoq
Subject: Re: last [real life] meeting's notes
Date: Thu, 24 Jul 2003 02:52:22 -0400
User-agent: Mutt/1.3.28i

[Notes reordered]

On Wed, Jul 23, 2003 at 06:32:33PM -0400, Cristian Gutierrez wrote:

Rewrite issues:
> - Alvaro suggests redoing everything in C (performance-wise, current
>   implementation sucks badly).
> - Regarding wether we should hack some of the three versions available
>   or to start a new one to test new features/approachs, the was no
>   agreement (Alvaro suggested that "#!/usr/bin/perl -w \n use strict;' was
>   a good start, but I don't even consider it :o). Hack your way out :-P

I agree with a major rewrite. Disagree about an specific language: some parts
in perl/python, others in C ; everything hooked to a cleaner interface [1].



Database optimizations:
> - we're compelled to use B-Trees to store the score table. It is also
>   necessary to store frequencies in spam and no-spam corpuses separately
>   for each word (3-column table) [this is what ma~nungo already did, I
>   guess]
> - separate too common words from uncommon ones, in two B-Trees. Zipf's
>   law says that the most frecuently used words are just a few and we can
>   use that to speed up the filter (frequent-words-tree is small enough to
>   search through it very fast, and it's rarely ever needed to search in
>   the other [bloated] one). We have to draw the line to establish what's
>   a frequent word and what not.
> - Use a trie? (fast memory-wise, but not with disk-saved trie?)

About low/high frecuency words: just split 20-80% according to hits. This
means a new task for the system: database maintenance.

Agree :-)



Tokenizing:
> - headers: 
>            * smarter tokenizer (keeps ips' octets, prices, phone
>              numbers, etc)

Sure!

>            * prefix tokens with header ('subject::free', for example).

I prefer more than one database: one for headers, another for body...
another for images, yet another for x-face headers.



Filtering magic:
> - kill some stopwords (prepositions, and whatnot), and/or short (1..3
>   letters) words.

Hmmm... I would agree if there is proof of real gain for the system, rather
than "It seems a good idea(tm)". IN ALL CASES, not just my inbox...

> - HTML: follow graham's law and save only the most 'interesting' parts:
>   font colors and... something else that I can't remember now. Is there
>   anything else 'interesting' in html spam?
> - Do something with html comments... or even render html completely? [a
>   regexp to strip comments is easy, may be we should try that]

Same as above.
I would render it through "lynx -dump" ...


> ps: I guess that it would be a good idea to broadcast the stuff anyone
> is going to try. I'll go with comments, prefixes and the smartass
> tokenizer.


[1] http://mail.nongnu.org/archive/html/bsf-devel/2003-07/msg00010.html

I'm going to do this.



-- 
Aldrin [nunca vio television porque es muy fome]




reply via email to

[Prev in Thread] Current Thread [Next in Thread]