From MAILER-DAEMON Sun Jul 06 22:22:11 2003 Received: from list by monty-python.gnu.org with archive (Exim 4.20) id 19ZLdi-0001Jg-CG for mharc-ifile-discuss@gnu.org; Sun, 06 Jul 2003 22:22:02 -0400 Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.20) id 19ZLdZ-0000nq-Vj for ifile-discuss@nongnu.org; Sun, 06 Jul 2003 22:21:54 -0400 Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.20) id 19ZLdU-0000Vb-F7 for ifile-discuss@nongnu.org; Sun, 06 Jul 2003 22:21:49 -0400 Received: from rhombus.bright.net ([209.143.0.75]) by monty-python.gnu.org with esmtp (Exim 4.20) id 19ZLdQ-0000M1-N4 for ifile-discuss@nongnu.org; Sun, 06 Jul 2003 22:21:44 -0400 Received: from raptor1.jonadab.homeip.net.bright.net (craw-cas3-cs-10.dial.bright.net [209.143.57.147] (may be forged)) by rhombus.bright.net (8.12.9/8.12.9) with ESMTP id h672LbA9021631 for ; Sun, 6 Jul 2003 22:21:40 -0400 (EDT) Sender: root@jonadab.homeip.net To: ifile-discuss@nongnu.org Subject: Re: [Ifile-discuss] Re: html tag stripping References: <20030625184805.GA23935@bushong.net> <20030625210856.GA28111@bushong.net> <20030626170129.GD28111@bushong.net> Mail-Copies-To: never X-Face: H&pa\VUTt$a$JF\\/7ZattR4}wR#!rbu?UnpF(ecZ2Y}ah3O+Y3i,y{P0FB3QuS)5W-e`a=.$=lJHsVA-HrF=[C#x|}J~Bvkm!,/t>Uqu-i*P- !n*f[\a/i(pIGO0-hW~b+R~Sm2mq1+0%ecgvl[21tIf+BZs-i>qu-\?1tia5pw6bWuXIZ.vLy}kE0m \mx)~/dx_? Date: 06 Jul 2003 22:21:26 -0400 In-Reply-To: <20030626170129.GD28111@bushong.net> Message-ID: <65mfjcux.fsf@jonadab.homeip.net> Lines: 48 User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.2.93 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-BeenThere: ifile-discuss@nongnu.org X-Mailman-Version: 2.1b5 Precedence: list List-Id: List-Help: List-Post: List-Subscribe: , List-Archive: List-Unsubscribe: , X-List-Received-Date: Mon, 07 Jul 2003 02:22:00 -0000 David Bushong writes: > It's not too hard to do with a preprocessor, but you want to do > things like skip headers, not go too far into large MIME files, etc. You don't just want to skip headers; I posit that you want to use information from the headers to determine what (if any) preprocessing to do. For example, you almost certainly do NOT want to strip what you think are HTML tags from text/plain content. You'd end up stripping out all sorts of things you didn't intend to: email addresses in angle brackets, words that are marked up in POD (any word that's bold, any word that's in italics (e.g., the name of the module probably), and so forth), pseudo-HTML intended to be read by humans as part of the message, code snippets in discussions of XML data or similar items, things being compared (inequalities) in pseudocode or math, things between certain types of smilies, and who knows what. Perhaps more significantly, you don't generally see a spammer sending HTML/SGML/XML/XSLT/RDF/XUL/etc as text/plain, because if they did the user would see all the ugly illegible markup, which isn't what the spammer wants, normally. However, stripping or in some way processing tags from text/html content might have significant merit. This raises the question of whether you also want to plaintext-ise other common non-plaintext mail formats -- text/ms-rtf, text/enriched, base64, uuencoding, and the like. Perhaps a plugin architecture is in order -- ifile could parse the message into sections, each section having a given content-type and encoding, and then for each section see if there is a preprocessor plugin installed for that encoding (if so use it) and content-type (if so, use it) before proceeding. By "plugin" here I don't mean necessarily a dynamic library; a call to an external program could work if the interface were well-defined. Frankly, the interface could be as simple as ifile passing the raw data on standard input to the preprocessor and using its standard output as the decoded/preprocessed content. That might be considered inefficient, but it would work, and it would establish a low-bar entry level for people writing preprocessor plugins, and the performance hit would only be taken when the preprocessors were being used, presumably. What preprocessor command (if any) to use for various encodings and types of content could just be specified in the ifile configuration. Am I making any sense? From MAILER-DAEMON Sun Jul 06 22:51:34 2003 Received: from list by monty-python.gnu.org with archive (Exim 4.20) id 19ZM6H-0004o2-LW for mharc-ifile-discuss@gnu.org; Sun, 06 Jul 2003 22:51:33 -0400 Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.20) id 19ZM6E-0004gZ-8p for ifile-discuss@nongnu.org; Sun, 06 Jul 2003 22:51:30 -0400 Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.20) id 19ZM63-0004Hm-SA for ifile-discuss@nongnu.org; Sun, 06 Jul 2003 22:51:21 -0400 Received: from bushong.net ([216.36.66.245]) by monty-python.gnu.org with esmtp (Exim 4.20) id 19ZM5o-0003Yv-0K for ifile-discuss@nongnu.org; Sun, 06 Jul 2003 22:51:04 -0400 Received: from firebat.davedawn.net (dbushong@localhost [127.0.0.1]) by bushong.net (8.12.6p2/8.12.7) with ESMTP id h672p2if016357 for ; Sun, 6 Jul 2003 19:51:02 -0700 (PDT) (envelope-from dbushong@firebat.davedawn.net) Received: (from dbushong@localhost) by firebat.davedawn.net (8.12.6p2/8.12.7/Submit) id h672p1B8016352 for ifile-discuss@nongnu.org; Sun, 6 Jul 2003 19:51:02 -0700 (PDT) Date: Sun, 6 Jul 2003 19:51:01 -0700 From: David Bushong To: ifile-discuss@nongnu.org Subject: Re: [Ifile-discuss] Re: html tag stripping Message-ID: <20030707025101.GW28111@bushong.net> References: <20030625184805.GA23935@bushong.net> <20030625210856.GA28111@bushong.net> <20030626170129.GD28111@bushong.net> <65mfjcux.fsf@jonadab.homeip.net> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="3lcZGd9BuhuYXNfi" Content-Disposition: inline In-Reply-To: <65mfjcux.fsf@jonadab.homeip.net> User-Agent: Mutt/1.4i X-PGP-Key: http://bushong.net/dave/text/gpg-key.asc X-BeenThere: ifile-discuss@nongnu.org X-Mailman-Version: 2.1b5 Precedence: list List-Id: List-Help: List-Post: List-Subscribe: , List-Archive: List-Unsubscribe: , X-List-Received-Date: Mon, 07 Jul 2003 02:51:31 -0000 --3lcZGd9BuhuYXNfi Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Looks good to me! Until such a beast arrives, I'm trying out a quick perl preprocessor I wrote using some ideas suggested by various people. I'm not actually sure (yet) how much better results I'm getting; I'll probably go back and run it on my old spam to see how much it improves/degrades accuracy, but just in case you'd like to give it a try, I've attached it. Usage is just basically : cat message | ifilepp | ifile -this -that -other Don't bother reporting bugs; I'm still hacking. --David Bushong On Sun, Jul 06, 2003 at 10:21:26PM -0400, Jonadab the Unsightly One wrote: > > ... > > Perhaps a plugin architecture is in order -- ifile could parse the > message into sections, each section having a given content-type and > encoding, and then for each section see if there is a preprocessor > plugin installed for that encoding (if so use it) and content-type (if > so, use it) before proceeding. > > By "plugin" here I don't mean necessarily a dynamic library; a call to > an external program could work if the interface were well-defined. > Frankly, the interface could be as simple as ifile passing the raw > data on standard input to the preprocessor and using its standard > output as the decoded/preprocessed content. That might be considered > inefficient, but it would work, and it would establish a low-bar entry > level for people writing preprocessor plugins, and the performance hit > would only be taken when the preprocessors were being used, > presumably. What preprocessor command (if any) to use for various > encodings and types of content could just be specified in the ifile > configuration. > > Am I making any sense? > > > > _______________________________________________ > Ifile-discuss mailing list > Ifile-discuss@nongnu.org > http://mail.nongnu.org/mailman/listinfo/ifile-discuss --3lcZGd9BuhuYXNfi Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename=ifilepp #!/usr/bin/perl -w # # ifilepp -- ifile preprocessor: ifilepp -h for usage # use strict; use Getopt::Std; use vars qw($defmax); ## variable declaration my (%opt, $bytes, $maxBytes, %validTag, $tagBuf, $lineBuf, $base64Printed); ## configuration $defmax = 4096; for (qw( html head body table td tr tbody script form input frame p ul ol dl b i a li center title img meta div span select option optgroup )) { $validTag{uc()}++; $validTag{lc()}++; $validTag{'/' . uc()}++; $validTag{'/' . lc()}++; } ## startup and options $0 =~ s,.*/,,; getopts('hHBMNm:', \%opt); usage() if $opt{h}; ## intialization $maxBytes = $opt{m} || $defmax; $tagBuf = ''; $lineBuf = ''; $bytes = 0; $base64Printed = 0; ## main loop while (<>) { ## skip header if (1 .. /^\r?$/) { print; next; } ## max size if (($bytes += length()) > $maxBytes) { print($tagBuf, $lineBuf, <>); exit; } ## skip base64 data unless ($opt{M}) { if (/^Content-Transfer-Encoding:\s+base64/i../^--/) { print "BaseSixtyFour\n" unless $base64Printed++; next; } $base64Printed = 0; } ## join ='ed lines if (/=\r?$/) { $lineBuf .= $`; next; } elsif (length($lineBuf)) { $_ = $lineBuf . $_; $lineBuf = ''; } ## Limited Entity Substitution s/ ?/ /gi; ## BadHTML s/(\S*)<([^\s>]*)>(\S*)/$validTag{$2} ? $& : " $1$3 BadHTML "/ge unless $opt{B}; ## NonEnglish s/\b\w*[\x80-\xff]+\w*\b/ NonEnglish /g unless $opt{N}; ## HTML tags unless ($opt{H}) { ## complete HTML tags s/<[^>]+>//g; ## close HTML tags if (/^[^<]+>/ && length($tagBuf)) { $tagBuf = ''; $_ = $'; } ## open HTML tags if (/<[^>]+$/) { if (length($tagBuf)) { print $tagBuf; $tagBuf = ''; } $tagBuf = $&; $_ = $`; } elsif (length($tagBuf)) { $tagBuf .= $_; next; } } print; } ## subroutines sub usage { die <