lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev strict SortaSGML rules for <PRE>...</PRE> content


From: Klaus Weide
Subject: Re: lynx-dev strict SortaSGML rules for <PRE>...</PRE> content
Date: Fri, 6 Aug 1999 22:36:45 -0500 (CDT)

On Thu, 5 Aug 1999, Vlad Harchev wrote:
> On Fri, 6 Aug 1999, Klaus Weide wrote:
> > On Fri, 6 Aug 1999, Leonid Pauzner wrote:
> > 
> > > HTML 4.0 said the following (and <Hx> does not listed in the exclusions):
> > > ========
> > > 
> > > 
> > >   9.3.4 Preformatted text: The PRE element
> > > 
> > > <!ENTITY % pre.exclusion 
> > > "IMG|OBJECT|APPLET|BIG|SMALL|SUB|SUP|FONT|BASEFONT">
> > > 
> > > <!ELEMENT PRE - - (%inline;)* -(%pre.exclusion;) -- preformatted text -->
> >                     \---------/ \----------------/         
> >                          |              |
> >                          |            These are _additionally_ forbidden,
> >  Everything that matches %inline;     and the exclusion applies at any
> >  is allowed as a direct child.        level (not just for direct children).
> >  Look for
> >  <!ENTITY % inline .....>
> >  somewhere.  Since there is no
> >  other allowed content specified,
> >  everything that does not match 
> >  %inline; is not allowed.
> > 
> > H2 is forbidden because it is not covered by %inline;.
> 
>  As for me, I'd like to have a switch that will modify contents of the 
> tags[HTML_PRE] (this modification should take place in HTSwitchDTD, and depend
> on newly added variable to LYMain.c, that will allow the H[1-6] in <PRE> in
> SortaSGML mode - some docs (on several russian sites) are converted to html
> from plain text with sed script, and they enclose all text in <PRE>, but use
> H[1-6] for marking sections and paragrphs. When viewed with SortaSGML mode,
> the <PRE> </PRE> that surround everything are ignored, original
> formatting is lost, and document becomes completely unreadable.
> 
>  Here is a tiny patch to do this (this is tested and works fine, and don't
> think that I'm hiding my patches - I wrote this patch after reading this
> message). I'm very confused by the  need of adding  another commandline option
> and lynx.cfg setting (but this should be done)  - Leonid, can
> you extend the patch to "production level" (add lynx.cfg setting, commandline
> option, few lines to lynx.cfg with comments, document it in lynx.man)?
> 
>  This functionality depends on the newly added variable
> 'allow_headers_in_pre'. This code should be added to HTMLDTD.c:HTSwitchDTD
> after the array is copied.
> 
>     if (allow_headers_in_pre) {
>         tags[HTML_PRE].contains |= Tgc_Plike;
>         tags[HTML_PRE].icontains |= Tgc_Plike;
>     };

I think adding options for this kind of micro-changes of parsing details
is bad.  Where does it lead, and where will it end?
Today you note that there are "several russian sites" that are broken,
which happen to look better with one little change.  So you add a flag
to make that change optional.  In two weeks somebody else notices that
some (or just one) of his/her favorite sites "benefits" from another
little change, and adds a flag for that.  And so on.  Soon we'll have
twenty or fifty or a hundred new options
  -allow_headers_in_pre
  -allow_pre_in_table
  -allow_foo_in_blah
  -allow_anchors_to_span_table_cells
  -force_empty_font_tag
and so on.  Or maybe we have only five or ten.  How many, and which,
will depend on what kinds of pages some folks close to the lynx
development process and able to create patches happen to read.

The proper way to achieve this kind of detailed configurability of
parsing would be to make lynx use (and parse) a real DTD, which could
then be taken from a user-specified file.  Nobody is thinking of doing
that, afaik.  But if detailed control as above is really desired, this
should be the way to go.

Besides, your -allow_headers_in_pre as suggested would not just affect
headers, but also other block elements that happen to be in the class
of P-like elements (Tgc_Plike), including P.  Not only is the option
name misleading, but things like "which elements are treated like P"
should remain internal details of the "SortaSGML" "DTD" (which isn't
anything like a real DTD, only "sort of" like it).  These matters
don't deserve to be carved in stone by introducing specific options.
Once specific options like -allow_headers_in_pre exists, there isn't
much freedom left - if compatibility between versions is of any
importance - to change the "DTD" (perhaps to make it just a bit less
"sorta", or bring it more in line with HTML 4.0, or jsut fiddle with
some bits, or really change from using a "DTD" to using a DTD).

Given that no plans for a clean mechanism for detailed parsing
modifications exists, and that the proposed approach would lead
straight into a big mess, the best we can do is to tell folks to use
"TagSoup" mode for some kinds of broken sites, or try to find a better
fixed set of bits to use in the "SortaSGML" "DTD" without letting it
degenerate into a copy of "TagSoup", or both.  Don't be afraid to
change the bits of the "DTD" if you determine that it would improve
things for some broken sites without breaking other things.  Don't
burden the user with having to make *every* little decision because
you (we) could not decide. :)

As a more specific criticism, I feel that changing the 'contains' and
'icontains' as you suggest - whether dynamically at runtime or fixed -
leads straight back to "TagSoup".  It is changing the info about tags
to not express their "real" content model, but something made up, in
order to achieve some specific error recovery behavior.  (More or less
the same as saying an element is empty when it isn't.  I invented
"SortaSGML" parsing to get rid of some of this [and therefore get rid
of the need to handle mis-ordered tags at the HTML.c level].)

Instead I would strongly prefer - as long as we are just thinking about
the "DTD" structures such as they are and not a fundamental redesign -
that 'contains', 'icontains', 'contained', and 'icontained' continue[*]
to reflect the "real", official DTD, and that changes in behavior are
done by changing only the 'behavioral' fields ('canclose', 'flags').

Coming back to the Hn vs. PRE case, the effect desired by Vlad for
<Hn> tags can be achieved by changing a bit in 'canclose' in each the
T_Hn macros.  It will also have an effect on <Hn>'s interaction with
some other tags though which may not be desirable (while Vlad's
approach has an effect on <PRE>'s interaction with some other tags
than <Hn> which may also not be desirable).

[*] No, the fields don't really reflect the official DTD very well
    currently.  First, they can only approximate it anyway, since
    the 'language' used can only express some aspects of a real DTD,
    and very summarily.  It's only "sorta".  Second, the classification
    of elements doesn't reflect any offical HTML version real well.
    Third, there are lots of elements that don't even exist in, say,
    HTML 4.0.  Fourth, tags_new[] was all derived from the original
    (~= TagSoup) info, info for some elements was left "wrong" because
    that's the only way they "work" (without code changes).  Fifth,
    there are probably lots of errors in the details.  
    Still, in principle 'contents' (the content type like SGML_MIXED
    etc.) and the four 'descriptive' fields listed above
    ({,i}contain{s,ed}) should try to describe the "real" model,
    and behavior tweaks should be expressed elsewhere; at least that
    was the original intention.  To put it yet another way, one
    set of fields says what is allowed; another set of fields say
    what to do (or whether to do anything) in case those rules are
    violated.


  Klaus


reply via email to

[Prev in Thread] Current Thread [Next in Thread]