lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: LYNX-DEV Re: new Lynx SGML.c parser


From: Christopher R. Maden
Subject: Re: LYNX-DEV Re: new Lynx SGML.c parser
Date: Fri, 25 Apr 1997 18:11:15 GMT

[Klaus Weide]
> Well I was thinking of you when I started this "new parser"[*]
> project.  I remember you made the claim that a structured parser
> with error recovery heuristics could improve handling of invalid
> markup (or similar wording; I hope I didn't get your meaning too
> wrong).

Hmm - did I say that?  Well, it depends on the class of error.
Keeping a tree structure can help you realize when tags are
mis-matched and tell you which ones probably need to be closed.  This
will not match Netscape's behavior, which is to keep a couple of
stacks; start-tags for elements of a class (like the font-changing
class) will push formatting on the stack, and any end tag for that
class will pop the stack.

OTOH, real SGML parsing can be a limitation for the stuff on the Web;
our DynaBase Web Management system had real problems with <table
border> as SGML.  (If you want to know why this sucks from an SGML
point of view, ask me off-line.)  We had to add special cases for some
HTML crap introduced by certain vendors.

> So there is now some way to test that claim...  This of course is
> not doing real SGML parsing, just trying to resemble it a bit
> better.  (Not that I really understand all the things a real SGML
> parser is supposed to do...)

No one does. d-:  That's one of the reasons for XML - it *should* be
possible to understand everything an XML parser is supposed to do.

> [*] It is also not really a "new" parser, just the old one, with
> some exceptions taken out, and some (crude) heuristics and some more
> per-element information added in.  All changes only refer to the
> content models and nesting aspects.  The added "DTD" information is
> hardwired and looks like this:
> 
> #define T_ABBREV        0x0002,0x8B04F,0x8FFFF,0xA778F,0xF7FBF,0x00003,0x00000
> 
> which is rather unreadable but fits the info on one line per element
> :) and there's still some unused bits left in that...

This is (probably) a much better HTML parser, but it's still wired to
one DTD.

> Why not keep it on the list?  At least I would be interested to hear
> what you have in mind.

Sure.  What I had in mind may not work, but I was thinking of storing
the whole parsed document in memory (or putting part of it on disk as
virtual memory).  I was just thinking of our own made-up pointer
structure, but I think that the Document Object Model would be a good
way to do it.  This was talked about a lot at WWW6; the XML folks and
the WAI folks are very excited about it.  It provides a standard
interface to a parsed document; together with XML, it gives a lot of
power.  This would not replace Lynx's HTML parser, but would provide a
new internal MIME type for handling XML.

The XML-link interpreter would be built on top of (or partly sunk into
the top layer of) the DOM, and provide hyperlinks; when XML-style is
done, there would need to be a parser for the stylesheets, and then
the parsed styles would be used as a layer between the DOM and the
eventual output device.  The style interpreter wouldn't need to be
full XML-style; there are some things (like font size or family) that
wouldn't be applicable to Lynx.

-Chris
-- 
Christopher R. Maden                  One Richmond Square
DynaText SIT Technical Support        Providence, RI 02906 USA
Inso Corporation                      +1.401.421.9550 (voice)
Electronic Publishing Solutions       +1.401.521.2030 (facsimile)
;
; To UNSUBSCRIBE:  Send a mail message to address@hidden
;                  with "unsubscribe lynx-dev" (without the
;                  quotation marks) on a line by itself.
;

reply via email to

[Prev in Thread] Current Thread [Next in Thread]