[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: last [real life] meeting's notes
From: |
Aldrin Martoq |
Subject: |
Re: last [real life] meeting's notes |
Date: |
Thu, 24 Jul 2003 02:52:22 -0400 |
User-agent: |
Mutt/1.3.28i |
[Notes reordered]
On Wed, Jul 23, 2003 at 06:32:33PM -0400, Cristian Gutierrez wrote:
Rewrite issues:
> - Alvaro suggests redoing everything in C (performance-wise, current
> implementation sucks badly).
> - Regarding wether we should hack some of the three versions available
> or to start a new one to test new features/approachs, the was no
> agreement (Alvaro suggested that "#!/usr/bin/perl -w \n use strict;' was
> a good start, but I don't even consider it :o). Hack your way out :-P
I agree with a major rewrite. Disagree about an specific language: some parts
in perl/python, others in C ; everything hooked to a cleaner interface [1].
Database optimizations:
> - we're compelled to use B-Trees to store the score table. It is also
> necessary to store frequencies in spam and no-spam corpuses separately
> for each word (3-column table) [this is what ma~nungo already did, I
> guess]
> - separate too common words from uncommon ones, in two B-Trees. Zipf's
> law says that the most frecuently used words are just a few and we can
> use that to speed up the filter (frequent-words-tree is small enough to
> search through it very fast, and it's rarely ever needed to search in
> the other [bloated] one). We have to draw the line to establish what's
> a frequent word and what not.
> - Use a trie? (fast memory-wise, but not with disk-saved trie?)
About low/high frecuency words: just split 20-80% according to hits. This
means a new task for the system: database maintenance.
Agree :-)
Tokenizing:
> - headers:
> * smarter tokenizer (keeps ips' octets, prices, phone
> numbers, etc)
Sure!
> * prefix tokens with header ('subject::free', for example).
I prefer more than one database: one for headers, another for body...
another for images, yet another for x-face headers.
Filtering magic:
> - kill some stopwords (prepositions, and whatnot), and/or short (1..3
> letters) words.
Hmmm... I would agree if there is proof of real gain for the system, rather
than "It seems a good idea(tm)". IN ALL CASES, not just my inbox...
> - HTML: follow graham's law and save only the most 'interesting' parts:
> font colors and... something else that I can't remember now. Is there
> anything else 'interesting' in html spam?
> - Do something with html comments... or even render html completely? [a
> regexp to strip comments is easy, may be we should try that]
Same as above.
I would render it through "lynx -dump" ...
> ps: I guess that it would be a good idea to broadcast the stuff anyone
> is going to try. I'll go with comments, prefixes and the smartass
> tokenizer.
[1] http://mail.nongnu.org/archive/html/bsf-devel/2003-07/msg00010.html
I'm going to do this.
--
Aldrin [nunca vio television porque es muy fome]