ifile-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Ifile-discuss] Re: html tag stripping


From: Jonadab the Unsightly One
Subject: Re: [Ifile-discuss] Re: html tag stripping
Date: 06 Jul 2003 22:21:26 -0400
User-agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.2.93

David Bushong <address@hidden> writes:

> It's not too hard to do with a preprocessor, but you want to do
> things like skip headers, not go too far into large MIME files, etc.

You don't just want to skip headers; I posit that you want to use
information from the headers to determine what (if any) preprocessing
to do.  For example, you almost certainly do NOT want to strip what
you think are HTML tags from text/plain content.  You'd end up
stripping out all sorts of things you didn't intend to:  email
addresses in angle brackets, words that are marked up in POD (any word
that's bold, any word that's in italics (e.g., the name of the module
probably), and so forth), pseudo-HTML intended to be read by humans as
part of the message, code snippets in discussions of XML data or
similar items, things being compared (inequalities) in pseudocode or
math, things between certain types of smilies, and who knows what.

Perhaps more significantly, you don't generally see a spammer sending
HTML/SGML/XML/XSLT/RDF/XUL/etc as text/plain, because if they did the
user would see all the ugly illegible markup, which isn't what the
spammer wants, normally.

However, stripping or in some way processing tags from text/html
content might have significant merit.

This raises the question of whether you also want to plaintext-ise
other common non-plaintext mail formats -- text/ms-rtf, text/enriched,
base64, uuencoding, and the like.

Perhaps a plugin architecture is in order -- ifile could parse the
message into sections, each section having a given content-type and
encoding, and then for each section see if there is a preprocessor
plugin installed for that encoding (if so use it) and content-type (if
so, use it) before proceeding.

By "plugin" here I don't mean necessarily a dynamic library; a call to
an external program could work if the interface were well-defined.
Frankly, the interface could be as simple as ifile passing the raw
data on standard input to the preprocessor and using its standard
output as the decoded/preprocessed content.  That might be considered
inefficient, but it would work, and it would establish a low-bar entry
level for people writing preprocessor plugins, and the performance hit
would only be taken when the preprocessors were being used,
presumably.  What preprocessor command (if any) to use for various
encodings and types of content could just be specified in the ifile
configuration.

Am I making any sense?





reply via email to

[Prev in Thread] Current Thread [Next in Thread]