lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev URLs with raw 8-bit chars (was: lynx: have bug)


From: Klaus Weide
Subject: Re: lynx-dev URLs with raw 8-bit chars (was: lynx: have bug)
Date: Mon, 22 Mar 1999 10:03:02 -0600 (CST)

On Mon, 22 Mar 1999, Leonid Pauzner wrote:

> 21-Mar-99 12:38 Klaus Weide wrote:
> > On Sun, 21 Mar 1999, Leonid Pauzner wrote:
> 
> >> One certain "problem" I personally run into is a utf-8 URL encoding:
> >> when HREF= have *open 8-bit text* the remote server (script)
> >> may (1) expect such bytes %xx-encoded,
> >> but lynx now (2) translate URLs from document charset to utf-8
> >> and then sent each byte %xx-encoded (an obvious check -
> >> a number of %xx encoded bytes increased).
> 
> > But URLs should never *have* unencoded 8-bit chars - and lynx
> Right.
> > never generates such URLs as a result of form submission (I hope).
> Right (we generate %xx encoded bytes (1), including local file names)
> 
> HTML4.0 on syntax of anchor names:
> http://www.w3.org/TR/PR-html40/struct/links.html#h-12.2.1
> says:
> 
>    Anchor names should be restricted to ASCII characters. Please consult
>    the section on representing non-ASCII characters is URLs for more
>    information.
> 
> and that section is under
> http://www.w3.org/TR/PR-html40/appendix/notes.html#urls
> (below)
> 
> So both (1) and (2) should be considered as a recovery from a broken document.

Yes, definitely.

> We usually bypass the problem when Lynx process both broken #fragment link
> and a broken NAME= target (they get resolved in a consistent way),
> but the problem occurs when we deals with one end only
> (say, link to a CGI script).
> 
> 
> 22-Mar-99 12:42 I wrote:
> > 21-Mar-99 20:37 Klaus Weide wrote:
> 
> >> This means that the user can usually toggle between the two interpretations
> >> with -raw / '@'.   It's not completely logical that the interpretation
> >> of URLs should depend on this.  OTOH there's the ease of switching, and
> >> it's more likely that encoding the raw value is the right thing (or even
> >> possible) when the user's environment is consistent with the server's.
> 
> > Completely wrong to overload -raw mode here (to ask user
> > to get the document unreadable in order to follow a link),

I see your point.

But those links are really broken.  Lynx tires to do _something_ with
them, which might or might not work.  It could instead refuse to accept
such links at all, instead of hiding the fact that it's basically just
guessing what theauthor meant.

> > it may be switchable like "dsoft-quotes" instead.
> 
> Now I think we may overload "dsoft-quotes" to toggle between
> two interpretations, the original meaning of this key is a work around
> the bug in HTML anchor which is very close to discussed problem.
> (One should decide which "interpretation" is "standard"
> and which is a workaround).

Overloading dsoft-quotes doesn't make much sense to me - if soft-dquotes
is still a useful setting, a specific setting (ON or OFF) is sometimes
_needed_ to get the right error recovery.  This should be independent
from how 8-bit URLs are treated, or we end up having another form of
'ask the user to get the document unreadable' (in rare cases).

The straight way, form the user perspective, would be to introduce
Yet Another Option for this, independent of other settings.
Once this becomes an independent option, a third value could be allowed:
'don't hex-escape at all', which was (mostly) the situation before the
CHANGES entries you cited (and had problems - that's why it was changed -
but also may sometimes be the only thing that works).

But making this a separate, 'official' setting implies a promise that
it works as advertised.  The current implementation doesn't do the
to-UTF-8-then-to-hex in all cases (e.g it cannot if input is CJK),
and it doesn't do the raw-hex in all cases (it cannot if the HREF
attribute contains &entity; for character > U+0100 from arbitrary
character set).  So the new setting would either work unreliably (not
in all cases), or it's explamation would have to have all sorts of
disclaimers, or the whole mechanism would have to be substantially
revised.

'Fixing' the behavior (in both senses of the word) also introduces
yet another aspect that could easily be broken when modifying the
way character translation works.  That could be an unwelcome
limitation for making otherwise useful changes.

So I wouldn't want to add a new Option for this, or make the behavior
in any way more 'official'.  The benefits are limited, and arguably
such links are broken and _should_ be broken if it costs us too much
to attempt to 'fix' them.  But I may be biased, because I have never
encountered such URLs in reality.

According to the CHANGES you quoted (thanks for refreshing my memory),
2.7.2 chose to always do the UTF-8 thing.  That would be more consistent
and predictable (although it can't work for CJK etc.), but even less
what you want.   With our current code, you have at least a workaround
way (undocumented, and probably 'making the document unreadable' as you
say) to get the link interpreted so that it works.


   Klaus

reply via email to

[Prev in Thread] Current Thread [Next in Thread]