bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Problem with ÅÄÖ and wget


From: Ángel González
Subject: Re: [Bug-wget] Problem with ÅÄÖ and wget
Date: Thu, 03 Oct 2013 02:04:05 +0200
User-agent: Thunderbird

On 24/09/13 10:38, Tim Ruehsen wrote:
Just for completeness: these guessing steps called "encoding sniffing
algorithm" are described in 12.2.2.2.
But only "In some cases, it might be impractical to unambiguously determine
the encoding before parsing the document.".
Yes, it allows to start parsing with one encoding, then abort and change to a
different one.

I found this iso-8859-1 / windows-1252 issue mentioned on the Wikipedia
'windows-1252' page, but couldn't find it on the HTML Living Standard pages.
Could you give me a pointer, please ?
It's at the beginning of html parsing, it lists several encodings given by the page and the encoding you should use to parse them, saying it is a willful violation.


What do you think, how can we address the iso / windows encoding issue (should
we ?) ? As I understood, it is only valid for HTML5...
It's just a matter of comparing the input encoding with a well-known list and replace it.

Is there a practical need for the sniffing algorithm ?
If we want to deal with the "ÅÄÖ links" properly, we should do encoding detection.

Do you know any real web sites / pages where the encoding is ambiguous ?
I consider those web sites broken. But I don't have numbers.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]