Re: wget2 | "utf8" charset breaks urls with invalid utf-8 (#523)

wget-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: wget2 | "utf8" charset breaks urls with invalid utf-8 (#523)

From:	Tim Rühsen
Subject:	Re: wget2 \| "utf8" charset breaks urls with invalid utf-8 (#523)
Date:	Wed, 22 Apr 2020 10:41:17 +0000



Tim Rühsen commented:


Wget simply ignores <meta charset=...> while wget2 takes it (correctly) into 
account.

"test%E4.jpg" contains a character that is invalid utf-8.
Wget2 has a short circuit, if source and destination charset is the same, in 
this case "utf-8", the conversion is skipped. That's why 'charset=utf-8' 
continues without conversion error.

But "utf8" differs from "utf-8" and thus a conversion is applied, which fails 
with errno 84 (EILSEQ 84 Invalid or incomplete multibyte or wide character).

#### What can we do ?

Of course we can treat "utf8" as "utf-8". That would help in some situations, 
but that disguises the real problem.

We can use the URL as-is whenever a conversion error occurs (maybe converting 
to percent-encoded ASCII, if needed). That is a 'best try' strategy, and not 
guaranteed to succeed.

The real issue is broken page content, but we never ever can fix that for the 
whole web.

-- 
Reply to this email directly or view it on GitLab: 
https://gitlab.com/gnuwget/wget2/-/issues/523#note_328897404
You're receiving this email because of your account on gitlab.com.

[Prev in Thread]

Current Thread

[Next in Thread]

wget2 | "utf8" charset breaks urls with invalid utf-8 (#523), ., 2020/04/21
- Re: wget2 | "utf8" charset breaks urls with invalid utf-8 (#523), Tim Rühsen <=

Prev by Date: wget2 | "utf8" charset breaks urls with invalid utf-8 (#523)
Next by Date: Re: wget2 | newline in url not ignored (#522)
Previous by thread: wget2 | "utf8" charset breaks urls with invalid utf-8 (#523)
Next by thread: wget2 | Delete obsolete code comment (!471)
Index(es):
- Date
- Thread