bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Problem with ÅÄÖ and wget


From: Ángel González
Subject: Re: [Bug-wget] Problem with ÅÄÖ and wget
Date: Tue, 17 Sep 2013 00:17:21 +0200
User-agent: Thunderbird

On 16/09/13 12:50, Tim Ruehsen wrote:
Just to have it mentioned:
Your download (wget -r http://bmit.se/wget) succeeds, but it shouldn't !
IMHO, Wget has a bug here and just because of this bug your test case
succeeds.

Why ?
Your wget/index.html holds the UTF-8 encoded URL 'teståäöÅÄÖ', but neither the
server header (Content-Type: text/html) nor the document itself (META http-
equiv ...) defines the charset. That means the charset encoding of index.html
should be ISO-8859-1. See [1].
Wget should have taken the URL 'teståäöÅÄÖ' as ISO-8859-1 and convert it into
UTF-8, which would fail to download.

Conclusion
1. Be prepared that Wget will change it's behaviour sooner or later (make
sure, you specify / deliver the charset encoding of your documents).
2. Wget will/does have problems with ISO-8859-1 text/html pages if the charset
is not  specified AND special chars are used.

Someone proving me wrong ?
I think that in the past, if the document was in iso-8859-1, imho
it would be legal to give the server the url *encoded in iso-8859-1*, thus resulting in the same %-encoded url. However, rfc3986 & rfc3987 already set that they shall
be in utf-8.

[1] http://nikitathespider.com/articles/EncodingDivination.html
Note that these steps are outdated now (that was written at most at 2008).


On 16/09/13 16:29, Tony Lewis wrote:
Neither Firefox nor Internet Explorer can navigate that link. Both fail trying to retrieve teståäöÅÄÖ.
That's strange. I can browse it on Firefox 23. Perhaps its guessing is better.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]