[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

lynx-dev LYNX: just noticed this perl html-parser notice

From: David Combs
Subject: lynx-dev LYNX: just noticed this perl html-parser notice
Date: Sat, 22 Jan 2000 13:55:12 -0800 (PST)

> From address@hidden Sat Jan 22 13:52:54 PST 2000
> Article: 1346 of comp.lang.perl.announce
> From: Gisle Aas <address@hidden>
> Newsgroups: comp.lang.perl.announce,comp.lang.perl.modules
> Subject: HTML-Parser-3.01
> Date: 24 Dec 1999 17:50:35 GMT
> Organization: Aas Software
> Lines: 52
> Approved: address@hidden (comp.lang.perl.announce)
> Message-ID: <address@hidden>
> NNTP-Posting-Host:
> X-Disclaimer: The "Approved" header verifies header information for article 
> transmission and does not imply approval of content.
Xref: mindspring comp.lang.perl.announce:1346 comp.lang.perl.modules:30435

HTML-Parser-3.01 is now available on CPAN.  HTML-Parser-3 is a
complete rewrite of the HTML::Parser core in C with XS bindings.  The
new parser is significantly faster and has a few new features.  The
speedup when compared to HTML-Parser-2.25 is between 3x and 50x
depending on what you are doing.

The new parser interface is completely compatible with
HTML-Parser-2.25, but some parts of an HTML document are
recognized differently:

  - Anything inside <script> and <style> is returned as cdata text.
    HTML-Parser-2.25 recognizes markup within these sections.
    The same is true for the depreciated <xmp> and <plaintext> tags.

  - Nearly any characters are allowed in tag and attribute names.
    Previously, strange characters in names caused tags to be
    parsed as text.  This behaviour can be overridden to enforce strict
    tag and attribute naming.

  - Processing instruction (<?...> or <?...?>) are reported via the
    process event handler.

New features include:

  - Direct callbacks to avoid Perl's slower method calls.

  - Array storage of element information to avoid callbacks completely.

  - The arguments passed to callbacks or arrays are separately
    selectable for each element type.
    This allows more flexibility and faster argument preparation.
    It also allows more argument types to be added later without
    interfering with existing programs.

  - The byte positions of tokens within an element can be reported.
    This allows direct editing of the token with substr() instead of
    having to guess where the token is located.

  - Callbacks can abort parsing.

  - Marked sections.

  - XML mode.

  - Working examples are provided to demonstrate the new features.


Michael A. Chase and Gisle Aas

reply via email to

[Prev in Thread] Current Thread [Next in Thread]