[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

LYNX-DEV pre-announcing a new Lynx SGML.c parser

From: Klaus Weide
Subject: LYNX-DEV pre-announcing a new Lynx SGML.c parser
Date: Mon, 21 Apr 1997 07:54:31 -0600 (MDT)

Exciting news (well, you may disagree...),

 I have finished modifying the first stage "SGML" parsing in lynx
to be somewhat closer to a real SGML parser.  Essentially, I have
extended the per-tag information (given in HTMLDTD.c) to include
more of the content model info of a real DTD, and done away with the
special treatment of some tags in SGML.c.  end_element and also
start_element in SGML.c now do partial stack wind-downs, depending
on whether an element is "allowed" to close another one (and, in some
cases, whether the other element's end tag can be legally omitted).

 This delivers to the next stage (HTML.c) a series of
HTML_{start,end}_element which are always correctly ordered, for all
elements which are not declared as SGML_EMPTY, and I have removed the 
SGML_EMPTY flags from a number of tags that were specially treated
before, including P and (recently) FORM.  Note that I haven't made
*any* changes to HTML.c to accomodate the changes in SGML.c and
HTMLDTD.c.  It works with the unchanged HTML.c, which is great and
shows that these modules have remained reasonably independent of
each other; it does however not always give identical results
(screen appearance) even for valid HTML, which shows that sometimes
HTML.c is relying on specific hacks for specific elements in SGML.c
and the old "DTD". (for example, declaring P as SGML_EMPTY *and*
converting </P> to <P>).

 I would like to have HTML.c in a form that it could deal equally with
being called from the modified SGML.c parser, as well as from the
old-style parser (with a, possibly increasing, number of hacks).  This
would allow testing of recovery heuristics with the new parser and
comparison with the old way, without each time having to modify HTML.c.

 Fote, I would appreciate your help here :).  It would help if you at
least did not make changes to HTML.c that depend on new hacks introduced
in SGML.c and the HTMLDTD.  (I am not saying that you *did* make such 
changes recently; this is just a just-in-case request, I still haven't
checked whether the recent me->inUnderline changes fall in this
category.  Your clarification sounded a bit like it, but I am not sure
so will have to check the code.)

 (Also, I know and accept that you don't want to be considered an
"active developer" at this point.  However, as long as you are often
the first to make required and/or useful changes, and make them 
available, you'll have to accept that your mods continue to be at least
an important source of input for our development code :).  Given that,
my request above could help cut down on my [not your] time.) 

 I will make the code available as a more-experimental-than-usual update
to the devel code, as soon as I have considered some other misc. unrelated
changes.  Still without adapting HTML.c, 'cause I want this to get out
the door now, and would like people to test it... THe first goal then is
to reproduce Lynx's current behavior (as far as it is correct :) ) for
valid HTML, tweaking the recovery heuristics for invalid HTML will come
later.  I am not sure whether there is any screwed-up HTML out there where
my approach *already* gives better results, or whether it finally can be
made sophisticated enough to generally improve treatment of bad HTML (over
that already done by Fote's latest hacks).  Maybe a combination of 
approaches will finally give best results.


; To UNSUBSCRIBE:  Send a mail message to address@hidden
;                  with "unsubscribe lynx-dev" (without the
;                  quotation marks) on a line by itself.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]