lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev URLs with raw 8-bit chars (was: lynx: have bug)


From: Klaus Weide
Subject: Re: lynx-dev URLs with raw 8-bit chars (was: lynx: have bug)
Date: Mon, 22 Mar 1999 18:21:24 -0600 (CST)

On Mon, 22 Mar 1999, Leonid Pauzner wrote:
[kw:]
> > [...]  The current implementation doesn't do the
> > to-UTF-8-then-to-hex in all cases (e.g it cannot if input is CJK),
> > and it doesn't do the raw-hex in all cases (it cannot if the HREF
> > attribute contains &entity; for character > U+0100 from arbitrary
> iso-8859-1 is not the only charset ^^^^^^^^^^^^^^^^^
> > character set).

Ok then, to be more exact: [The current implementation cannot do the
raw-hex if] the HREF attribute contains &entity; for a character that
we cannot translate to the desired charset.  (and which would that
wanted charset be?  Presumably the document charset and not the display
character set.  but 'the current implementation', LYUCFullyTranslateString
and the way its input is prepared, isn't currently set up at all to
translate entities back _to_ the document charset.)

> >                  So the new setting would either work unreliably (not
> > in all cases), or it's explamation would have to have all sorts of
> > disclaimers, or the whole mechanism would have to be substantially
> > revised.
> 
> > 'Fixing' the behavior (in both senses of the word) also introduces
> > yet another aspect that could easily be broken when modifying the
> > way character translation works.  That could be an unwelcome
> > limitation for making otherwise useful changes.
> 
> > So I wouldn't want to add a new Option for this, or make the behavior
> > in any way more 'official'.  The benefits are limited, and arguably
> > such links are broken and _should_ be broken if it costs us too much
> > to attempt to 'fix' them.  But I may be biased, because I have never
> > encountered such URLs in reality.
> 
> Seems you overestimate the cost of such changes -

That may well be the case - and I don't want to suppress your enterprising
spirit too much. :) 

> we may add a flag and use a simple HTEscape() in HTML.c

But please not an HTEscape() on the whole string - that will certainly
do the wrong thing, for example if the string already contains '%'.

And please not in HTML.c - this logic should be in a separate place
(like, hmm, UCLYFullyTranslate() - whenever it is called as by the
TRANSLATE_AND_UNESCAPE_TO_STD macro).

Actually, it seems to me that the only way to preserve the original
bytes (in hex-escaped form) in _all_ cases would be to protect them
from translation, i.e. hex-escape them, before SGML_character does
_any_ messing around with the character.  That means that SGML.c
would need knowledge about what kind of attribute this is (HREF-type
vs. ALT-type), currently all attributes look the same to SGML.c.
But overall such a change may be less bothersome than first (potentially)
translating a URL and then (potentially) translating back.
(SGML_character() should always have the last 'raw' character available
in c_in - probably.)

> instead of UCLYFullyTranslate() call
> (and I have not seen character entities in HREF= in the reality).

And I have not seen either that or raw characters in HREF= in reality -
can you offer an example for the last one?


   Klaus

reply via email to

[Prev in Thread] Current Thread [Next in Thread]