[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: LYNX-DEV Re: new Lynx SGML.c parser
From: |
Christopher R. Maden |
Subject: |
Re: LYNX-DEV Re: new Lynx SGML.c parser |
Date: |
Fri, 25 Apr 1997 18:11:15 GMT |
[Klaus Weide]
> Well I was thinking of you when I started this "new parser"[*]
> project. I remember you made the claim that a structured parser
> with error recovery heuristics could improve handling of invalid
> markup (or similar wording; I hope I didn't get your meaning too
> wrong).
Hmm - did I say that? Well, it depends on the class of error.
Keeping a tree structure can help you realize when tags are
mis-matched and tell you which ones probably need to be closed. This
will not match Netscape's behavior, which is to keep a couple of
stacks; start-tags for elements of a class (like the font-changing
class) will push formatting on the stack, and any end tag for that
class will pop the stack.
OTOH, real SGML parsing can be a limitation for the stuff on the Web;
our DynaBase Web Management system had real problems with <table
border> as SGML. (If you want to know why this sucks from an SGML
point of view, ask me off-line.) We had to add special cases for some
HTML crap introduced by certain vendors.
> So there is now some way to test that claim... This of course is
> not doing real SGML parsing, just trying to resemble it a bit
> better. (Not that I really understand all the things a real SGML
> parser is supposed to do...)
No one does. d-: That's one of the reasons for XML - it *should* be
possible to understand everything an XML parser is supposed to do.
> [*] It is also not really a "new" parser, just the old one, with
> some exceptions taken out, and some (crude) heuristics and some more
> per-element information added in. All changes only refer to the
> content models and nesting aspects. The added "DTD" information is
> hardwired and looks like this:
>
> #define T_ABBREV 0x0002,0x8B04F,0x8FFFF,0xA778F,0xF7FBF,0x00003,0x00000
>
> which is rather unreadable but fits the info on one line per element
> :) and there's still some unused bits left in that...
This is (probably) a much better HTML parser, but it's still wired to
one DTD.
> Why not keep it on the list? At least I would be interested to hear
> what you have in mind.
Sure. What I had in mind may not work, but I was thinking of storing
the whole parsed document in memory (or putting part of it on disk as
virtual memory). I was just thinking of our own made-up pointer
structure, but I think that the Document Object Model would be a good
way to do it. This was talked about a lot at WWW6; the XML folks and
the WAI folks are very excited about it. It provides a standard
interface to a parsed document; together with XML, it gives a lot of
power. This would not replace Lynx's HTML parser, but would provide a
new internal MIME type for handling XML.
The XML-link interpreter would be built on top of (or partly sunk into
the top layer of) the DOM, and provide hyperlinks; when XML-style is
done, there would need to be a parser for the stylesheets, and then
the parsed styles would be used as a layer between the DOM and the
eventual output device. The style interpreter wouldn't need to be
full XML-style; there are some things (like font size or family) that
wouldn't be applicable to Lynx.
-Chris
--
Christopher R. Maden One Richmond Square
DynaText SIT Technical Support Providence, RI 02906 USA
Inso Corporation +1.401.421.9550 (voice)
Electronic Publishing Solutions +1.401.521.2030 (facsimile)
;
; To UNSUBSCRIBE: Send a mail message to address@hidden
; with "unsubscribe lynx-dev" (without the
; quotation marks) on a line by itself.
;
- LYNX-DEV pre-announcing a new Lynx SGML.c parser, Klaus Weide, 1997/04/21
- LYNX-DEV Re: new Lynx SGML.c parser, Klaus Weide, 1997/04/23
- Re: LYNX-DEV Re: new Lynx SGML.c parser,
Christopher R. Maden <=
- LYNX-DEV The method to tag soup madness, Al Gilman, 1997/04/25
- Re: LYNX-DEV Re: new Lynx SGML.c parser, Klaus Weide, 1997/04/25
- Re: LYNX-DEV Re: new Lynx SGML.c parser, Christopher R. Maden, 1997/04/25
- LYNX-DEV Internal MIME types, Klaus Weide, 1997/04/26
- Re: LYNX-DEV Internal MIME types, Al Gilman, 1997/04/26
- Re: LYNX-DEV Internal MIME types, Wayne Buttles, 1997/04/26
- Re: LYNX-DEV Internal MIME types, Klaus Weide, 1997/04/26
- Re: LYNX-DEV Internal MIME types, Al Gilman, 1997/04/26
- Re: LYNX-DEV Internal MIME types, Wayne Buttles, 1997/04/27
- Re: LYNX-DEV Internal MIME types, Al Gilman, 1997/04/27