[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
last [real life] meeting's notes
From: |
Cristian Gutierrez |
Subject: |
last [real life] meeting's notes |
Date: |
Wed, 23 Jul 2003 18:32:33 -0400 |
User-agent: |
Gnus/5.1002 (Gnus v5.10.2) Emacs/21.3 (gnu/linux) |
This is what I got:
- Alvaro suggests redoing everything in C (performance-wise, current
implementation sucks badly).
- we're compelled to use B-Trees to store the score table. It is also
necessary to store frequencies in spam and no-spam corpuses separately
for each word (3-column table) [this is what ma~nungo already did, I
guess]
- separate too common words from uncommon ones, in two B-Trees. Zipf's
law says that the most frecuently used words are just a few and we can
use that to speed up the filter (frequent-words-tree is small enough to
search through it very fast, and it's rarely ever needed to search in
the other [bloated] one). We have to draw the line to establish what's
a frequent word and what not.
- Use a trie? (fast memory-wise, but not with disk-saved trie?)
- kill some stopwords (prepositions, and whatnot), and/or short (1..3
letters) words.
- headers:
* smarter tokenizer (keeps ips' octets, prices, phone
numbers, etc)
* prefix tokens with header ('subject::free', for example).
- HTML: follow graham's law and save only the most 'interesting' parts:
font colors and... something else that I can't remember now. Is there
anything else 'interesting' in html spam?
- Do something with html comments... or even render html completely? [a
regexp to strip comments is easy, may be we should try that]
- Regarding wether we should hack some of the three versions available
or to start a new one to test new features/approachs, the was no
agreement (Alvaro suggested that "#!/usr/bin/perl -w \n use strict;' was
a good start, but I don't even consider it :o). Hack your way out :-P
- we should bring a calculator next time in order to pay the bill
quicker :)
- pizza was good! (and so was the 'lomito' ;-)
ps: I guess that it would be a good idea to broadcast the stuff anyone
is going to try. I'll go with comments, prefixes and the smartass
tokenizer.
sayonara!
(I'll be [mail-enabled] in La Serena until monday).
--
Cristian Gutierrez Linux user #298162
address@hidden http://www.dcc.uchile.cl/~crgutier
"Los ordenadores son inutiles. Solo pueden darte respuestas."
-- Pablo Picasso
- last [real life] meeting's notes,
Cristian Gutierrez <=