From MAILER-DAEMON Sun Mar 06 09:20:04 2005 Received: from mailman by lists.gnu.org with archive (Exim 4.43) id 1D7wbx-00073m-QS for mharc-ifile-discuss@gnu.org; Sun, 06 Mar 2005 09:20:02 -0500 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1D7wbp-0006yE-B5 for ifile-discuss@nongnu.org; Sun, 06 Mar 2005 09:19:54 -0500 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1D7wbh-0006ub-1j for ifile-discuss@nongnu.org; Sun, 06 Mar 2005 09:19:46 -0500 Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1D7wbf-0006pH-Sy for ifile-discuss@nongnu.org; Sun, 06 Mar 2005 09:19:43 -0500 Received: from [80.91.229.2] (helo=ciao.gmane.org) by monty-python.gnu.org with esmtp (TLSv1:DES-CBC3-SHA:168) (Exim 4.34) id 1D7wEf-00022m-M7 for ifile-discuss@nongnu.org; Sun, 06 Mar 2005 08:55:57 -0500 Received: from root by ciao.gmane.org with local (Exim 4.43) id 1D7w9F-0001cE-Uh for ifile-discuss@nongnu.org; Sun, 06 Mar 2005 14:50:21 +0100 Received: from d463cfd1.datahighways.de ([212.99.207.209]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sun, 06 Mar 2005 14:50:21 +0100 Received: from ino-qc by d463cfd1.datahighways.de with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sun, 06 Mar 2005 14:50:21 +0100 X-Injected-Via-Gmane: http://gmane.org/ To: ifile-discuss@nongnu.org From: ino-qc@spotteswoode.dnsalias.org (C. Fischer) Date: Sun, 06 Mar 2005 13:58:59 +0100 Lines: 14 Message-ID: Mime-Version: 1.0 Content-Type: text/plain X-Complaints-To: usenet@sea.gmane.org X-Gmane-NNTP-Posting-Host: d463cfd1.datahighways.de User-Agent: Gnus/5.110003 (No Gnus v0.3) Emacs/22.0.50 (berkeley-unix) Cancel-Lock: sha1:0ktXOkx7UdYj1MQ1N3ssPunyKEI= Sender: news X-Gmane-MailScanner: Found to be clean X-Gmane-MailScanner: Found to be clean X-MailScanner-From: mail-ifile@m.gmane.org X-MailScanner-To: ifile-discuss@nongnu.org Subject: [Ifile-discuss] ifile + MIME X-BeenThere: ifile-discuss@nongnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: ifile-discuss.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 06 Mar 2005 14:19:55 -0000 in these days, with complicated MIME messages and the applicability of ifile to the anti-spam domain, it seems ifile should grok MIME. qsf has quite sensible MIME handling: only MIME types "text/*" are classified, with HTML tags stripped and proper qp and base64 decoding. question: would it suffice to delete matching "<...>" pairs, or do they sometimes get escaped in some way, or is it legal to qp/base64 encode them? i'm thinking of taking qsfs MIME parser and add it to ifiles lexer. clemens From MAILER-DAEMON Mon Mar 07 07:51:47 2005 Received: from mailman by lists.gnu.org with archive (Exim 4.43) id 1D8Hi7-00035A-DR for mharc-ifile-discuss@gnu.org; Mon, 07 Mar 2005 07:51:47 -0500 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1D8Hi5-00033g-8e for ifile-discuss@nongnu.org; Mon, 07 Mar 2005 07:51:45 -0500 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1D8HeV-0002G7-Ob for ifile-discuss@nongnu.org; Mon, 07 Mar 2005 07:48:04 -0500 Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1D8HeP-00025h-BK for ifile-discuss@nongnu.org; Mon, 07 Mar 2005 07:47:57 -0500 Received: from [80.91.229.2] (helo=ciao.gmane.org) by monty-python.gnu.org with esmtp (TLSv1:DES-CBC3-SHA:168) (Exim 4.34) id 1D8HAx-0001SD-AR for ifile-discuss@nongnu.org; Mon, 07 Mar 2005 07:17:31 -0500 Received: from list by ciao.gmane.org with local (Exim 4.43) id 1D8H4j-0005Bo-HB for ifile-discuss@nongnu.org; Mon, 07 Mar 2005 13:11:05 +0100 Received: from d463cfd1.datahighways.de ([212.99.207.209]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Mon, 07 Mar 2005 13:11:05 +0100 Received: from ino-qc by d463cfd1.datahighways.de with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Mon, 07 Mar 2005 13:11:05 +0100 X-Injected-Via-Gmane: http://gmane.org/ To: ifile-discuss@nongnu.org From: ino-qc@spotteswoode.dnsalias.org (C. Fischer) Date: Mon, 07 Mar 2005 12:47:21 +0100 Lines: 16 Message-ID: <3bv77k4g.fsf@ID-23066.news.dfncis.de> Mime-Version: 1.0 Content-Type: text/plain X-Complaints-To: usenet@sea.gmane.org X-Gmane-NNTP-Posting-Host: d463cfd1.datahighways.de User-Agent: Gnus/5.110003 (No Gnus v0.3) Emacs/22.0.50 (berkeley-unix) Cancel-Lock: sha1:jGSF2rkzX24c/nX2nl7IrJ0NJDM= Sender: news X-Gmane-MailScanner: Found to be clean X-Gmane-MailScanner: Found to be clean X-MailScanner-From: mail-ifile@m.gmane.org X-MailScanner-To: ifile-discuss@nongnu.org Subject: [Ifile-discuss] naive bayes algorithm in ifile? X-BeenThere: ifile-discuss@nongnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: ifile-discuss.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 07 Mar 2005 12:51:45 -0000 another idea i'm toying with is making a (portable) standard-prolog implementation of naive bayes for (email/usenet) text classification. the free prologs have improved much over the years, and i want to know if a prolog implementation is fast enough. given n categories, t[i]; i {1..n} tokens per category, m[i]; i {1..n} messages per category and for every token a record (age, c:i); i {1..n}, could somebody please give a simple, english description of the algorithm needed to classify a message? i need to understand how token ageing can be used to keep the database small, containing only the tokens that contribute the most to classification and dropping the rest. do i really need floating point operations or can i get away with integer arithmetic? could rational numbers be a better solution? clemens From MAILER-DAEMON Mon Mar 07 07:52:57 2005 Received: from mailman by lists.gnu.org with archive (Exim 4.43) id 1D8Hel-0002JB-26 for mharc-ifile-discuss@gnu.org; Mon, 07 Mar 2005 07:48:19 -0500 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1D8HeS-0002Gf-Mj for ifile-discuss@nongnu.org; Mon, 07 Mar 2005 07:48:01 -0500 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1D8HeD-00026o-K7 for ifile-discuss@nongnu.org; Mon, 07 Mar 2005 07:47:46 -0500 Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1D8HeD-00025h-4v for ifile-discuss@nongnu.org; Mon, 07 Mar 2005 07:47:45 -0500 Received: from [80.91.229.2] (helo=ciao.gmane.org) by monty-python.gnu.org with esmtp (TLSv1:DES-CBC3-SHA:168) (Exim 4.34) id 1D8HJp-00023B-JM for ifile-discuss@nongnu.org; Mon, 07 Mar 2005 07:26:42 -0500 Received: from root by ciao.gmane.org with local (Exim 4.43) id 1D8HDj-0006Kq-2D for ifile-discuss@nongnu.org; Mon, 07 Mar 2005 13:20:23 +0100 Received: from d463cfd1.datahighways.de ([212.99.207.209]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Mon, 07 Mar 2005 13:20:23 +0100 Received: from ino-qc by d463cfd1.datahighways.de with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Mon, 07 Mar 2005 13:20:23 +0100 X-Injected-Via-Gmane: http://gmane.org/ To: ifile-discuss@nongnu.org From: ino-qc@spotteswoode.dnsalias.org (C. Fischer) Date: Mon, 07 Mar 2005 12:48:13 +0100 Lines: 4 Message-ID: Mime-Version: 1.0 Content-Type: text/plain X-Complaints-To: usenet@sea.gmane.org X-Gmane-NNTP-Posting-Host: d463cfd1.datahighways.de User-Agent: Gnus/5.110003 (No Gnus v0.3) Emacs/22.0.50 (berkeley-unix) Cancel-Lock: sha1:D2ayqutW6yzvmAxuAedbDYVANng= Sender: news X-Gmane-MailScanner: Found to be clean X-Gmane-MailScanner: Found to be clean X-MailScanner-From: mail-ifile@m.gmane.org X-MailScanner-To: ifile-discuss@nongnu.org Subject: [Ifile-discuss] usage of ifiles threshold option? X-BeenThere: ifile-discuss@nongnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: ifile-discuss.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 07 Mar 2005 12:48:09 -0000 could somebody please give an example of using ifiles `-T' (--threshold) option? i want to know how to derive a specific number for it. clemens From MAILER-DAEMON Mon Mar 07 07:55:33 2005 Received: from mailman by lists.gnu.org with archive (Exim 4.43) id 1D8Hll-0003yO-6L for mharc-ifile-discuss@gnu.org; Mon, 07 Mar 2005 07:55:33 -0500 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1D8HlU-0003uw-Ve for ifile-discuss@nongnu.org; Mon, 07 Mar 2005 07:55:18 -0500 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1D8HlQ-0003rv-Mm for ifile-discuss@nongnu.org; Mon, 07 Mar 2005 07:55:14 -0500 Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1D8HlO-0003pv-Lr for ifile-discuss@nongnu.org; Mon, 07 Mar 2005 07:55:10 -0500 Received: from [80.91.229.2] (helo=ciao.gmane.org) by monty-python.gnu.org with esmtp (TLSv1:DES-CBC3-SHA:168) (Exim 4.34) id 1D8HTP-0002m0-Lm for ifile-discuss@nongnu.org; Mon, 07 Mar 2005 07:36:36 -0500 Received: from root by ciao.gmane.org with local (Exim 4.43) id 1D8HNQ-0007cN-49 for ifile-discuss@nongnu.org; Mon, 07 Mar 2005 13:30:24 +0100 Received: from d463cfd1.datahighways.de ([212.99.207.209]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Mon, 07 Mar 2005 13:30:24 +0100 Received: from ino-qc by d463cfd1.datahighways.de with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Mon, 07 Mar 2005 13:30:24 +0100 X-Injected-Via-Gmane: http://gmane.org/ To: ifile-discuss@nongnu.org From: ino-qc@spotteswoode.dnsalias.org (C. Fischer) Date: Mon, 07 Mar 2005 12:46:17 +0100 Lines: 14 Message-ID: <7jkj7k68.fsf@ID-23066.news.dfncis.de> Mime-Version: 1.0 Content-Type: text/plain X-Complaints-To: usenet@sea.gmane.org X-Gmane-NNTP-Posting-Host: d463cfd1.datahighways.de User-Agent: Gnus/5.110003 (No Gnus v0.3) Emacs/22.0.50 (berkeley-unix) Cancel-Lock: sha1:F2t2vKXyeBHNzpsT+dZtuDQG1oc= Sender: news X-Gmane-MailScanner: Found to be clean X-Gmane-MailScanner: Found to be clean X-MailScanner-From: mail-ifile@m.gmane.org X-MailScanner-To: ifile-discuss@nongnu.org Subject: [Ifile-discuss] anon cvs access? X-BeenThere: ifile-discuss@nongnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: ifile-discuss.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 07 Mar 2005 12:55:22 -0000 what happened to ifiles CVS repository? ,---- | /src/ifile | 0 p2 # cvs -z0 up | -> main loop with CVSROOT=:pserver:anoncvs@subversions.gnu.org:/cvsroot/ifile | -> Connecting to subversions.gnu.org(199.232.41.3):2401 | cvs [update aborted]: connect to subversions.gnu.org(199.232.41.3):2401 failed: Operation timed out | -> Lock_Cleanup() `---- has something like the host or the organization of the repo changed? clemens From MAILER-DAEMON Tue Mar 08 09:02:01 2005 Received: from mailman by lists.gnu.org with archive (Exim 4.43) id 1D8fHc-0005vb-2o for mharc-ifile-discuss@gnu.org; Tue, 08 Mar 2005 09:02:00 -0500 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1D8fHX-0005uH-Kv for ifile-discuss@nongnu.org; Tue, 08 Mar 2005 09:01:55 -0500 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1D8fHR-0005qW-R9 for ifile-discuss@nongnu.org; Tue, 08 Mar 2005 09:01:51 -0500 Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1D8fHR-0005q1-NR for ifile-discuss@nongnu.org; Tue, 08 Mar 2005 09:01:49 -0500 Received: from [212.216.176.206] (helo=vsmtp12.tin.it) by monty-python.gnu.org with esmtp (Exim 4.34) id 1D8f2i-0004jw-F3 for ifile-discuss@nongnu.org; Tue, 08 Mar 2005 08:46:36 -0500 Received: from npp (82.48.161.86) by vsmtp12.tin.it (7.0.027) id 422481CF00304FCA for ifile-discuss@nongnu.org; Tue, 8 Mar 2005 14:44:19 +0100 Received: from oopla by npp with local (masqmail 0.2.11) id 1D8f0S-3JR-00 for ; Tue, 08 Mar 2005 14:44:16 +0100 Date: Tue, 8 Mar 2005 14:44:16 +0100 To: ifile-discuss@nongnu.org Subject: Re: [Ifile-discuss] usage of ifiles threshold option? Message-ID: <20050308134416.GA10036@pp> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.3.28i From: Paolo X-BeenThere: ifile-discuss@nongnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: ifile-discuss.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 Mar 2005 14:01:57 -0000 On Mon, Mar 07, 2005 at 12:48:13PM +0100, C. Fischer wrote: > could somebody please give an example of using ifiles `-T' (--threshold) > option? i want to know how to derive a specific number for it. hello Clemens, -T was introduced to allow for a 'grey zone' between the 2 winning categories (among 2 or more in the database). I.e., in a sense, it makes 1 further bin 'on the fly', into which the test item is thrown, whenever the 2 topmost ranks are closer than the threshold, in relative terms, according the the formula you get with --help: R=(r0-r1)/(r0+r1), R*1000 < THRESH if THRESH > 0. Actually, you get 2 'grey zones', as you'd get a response like cat1,cat2 or cat2,cat1 according to which rank is absolute max. In spam filtering, eg you can do a coarse classification with large threshold, and less comp.-expansive preprocessing, then reprocess with with narrower threshold, better preproc, MIME decoding etc. what makes into the 'unsure' bin on 2st pass. - In previous msg you mentioned MIME processing: AFAIKT, that's not much effective WRT spam/ham classification - see reports in other projects, eg CRM114 (crm114.sf.net) - see there as well for link to 'normalizemime', a tool to mangle/sanitize an RFC [2]822 msg in UTF-*. - For possible algos/how to implement BCR, besides ifile itself and related papers, see comments in crm114 code, and you may want also to have a look at dabcl / L.Breyer sw/site : http://www.lbreyer.com/emailtut.html hope his helps - if you come up with anything new/interesting pls report back :) -- paolo GPG/PGP id:0x21426690 kfp:EDFB 0103 A8D8 4180 8AB5 D59E 9771 0F28 2142 6690 "Indeed, it does come with warranty: it *will* fail, sometimes, somehow..." - software vendor