[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: wget2 | "utf8" charset breaks urls with invalid utf-8 (#523)
From: |
Tim Rühsen |
Subject: |
Re: wget2 | "utf8" charset breaks urls with invalid utf-8 (#523) |
Date: |
Wed, 22 Apr 2020 10:41:17 +0000 |
Tim Rühsen commented:
Wget simply ignores <meta charset=...> while wget2 takes it (correctly) into
account.
"test%E4.jpg" contains a character that is invalid utf-8.
Wget2 has a short circuit, if source and destination charset is the same, in
this case "utf-8", the conversion is skipped. That's why 'charset=utf-8'
continues without conversion error.
But "utf8" differs from "utf-8" and thus a conversion is applied, which fails
with errno 84 (EILSEQ 84 Invalid or incomplete multibyte or wide character).
#### What can we do ?
Of course we can treat "utf8" as "utf-8". That would help in some situations,
but that disguises the real problem.
We can use the URL as-is whenever a conversion error occurs (maybe converting
to percent-encoded ASCII, if needed). That is a 'best try' strategy, and not
guaranteed to succeed.
The real issue is broken page content, but we never ever can fix that for the
whole web.
--
Reply to this email directly or view it on GitLab:
https://gitlab.com/gnuwget/wget2/-/issues/523#note_328897404
You're receiving this email because of your account on gitlab.com.