bsf-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

last [real life] meeting's notes


From: Cristian Gutierrez
Subject: last [real life] meeting's notes
Date: Wed, 23 Jul 2003 18:32:33 -0400
User-agent: Gnus/5.1002 (Gnus v5.10.2) Emacs/21.3 (gnu/linux)

This is what I got:

- Alvaro suggests redoing everything in C (performance-wise, current
  implementation sucks badly).
- we're compelled to use B-Trees to store the score table. It is also
  necessary to store frequencies in spam and no-spam corpuses separately
  for each word (3-column table) [this is what ma~nungo already did, I
  guess]
- separate too common words from uncommon ones, in two B-Trees. Zipf's
  law says that the most frecuently used words are just a few and we can
  use that to speed up the filter (frequent-words-tree is small enough to
  search through it very fast, and it's rarely ever needed to search in
  the other [bloated] one). We have to draw the line to establish what's
  a frequent word and what not.
- Use a trie? (fast memory-wise, but not with disk-saved trie?)
- kill some stopwords (prepositions, and whatnot), and/or short (1..3
  letters) words.
- headers: 
           * smarter tokenizer (keeps ips' octets, prices, phone
             numbers, etc)
           * prefix tokens with header ('subject::free', for example).
- HTML: follow graham's law and save only the most 'interesting' parts:
  font colors and... something else that I can't remember now. Is there
  anything else 'interesting' in html spam?
- Do something with html comments... or even render html completely? [a
  regexp to strip comments is easy, may be we should try that]
- Regarding wether we should hack some of the three versions available
  or to start a new one to test new features/approachs, the was no
  agreement (Alvaro suggested that "#!/usr/bin/perl -w \n use strict;' was
  a good start, but I don't even consider it :o). Hack your way out :-P
- we should bring a calculator next time in order to pay the bill
  quicker :)
- pizza was good! (and so was the 'lomito' ;-)

ps: I guess that it would be a good idea to broadcast the stuff anyone
is going to try. I'll go with comments, prefixes and the smartass
tokenizer.

sayonara!
(I'll be [mail-enabled] in La Serena until monday).

-- 
Cristian Gutierrez                                 Linux user #298162
address@hidden           http://www.dcc.uchile.cl/~crgutier

"Los ordenadores son inutiles. Solo pueden darte respuestas." 
-- Pablo Picasso





reply via email to

[Prev in Thread] Current Thread [Next in Thread]