Re: [Bug-wget] URL encoding issues (Was: GNU wget 1.17.1 released)

From: Eli Zaretskii
Subject: Re: [Bug-wget] URL encoding issues (Was: GNU wget 1.17.1 released)
Date: Mon, 14 Dec 2015 21:58:59 +0200

> From: Tim Rühsen <address@hidden>
> Date: Mon, 14 Dec 2015 20:22:41 +0100
> >  1. The functions that call 'iconv' (in iri.c) don't make a point of
> >     flushing the last portion of the converted URL after 'iconv'
> >     returns successfully having converted the input string in its
> >     entirety.  IME, you need then to call 'iconv' one last time with
> >     either the 2nd or the 3rd argument set to NULL, otherwise
> >     sometimes the last converted character doesn't get output.  In my
> >     case, some URLs converted from CP1255 to UTF-8 lost their last
> >     character.  It sounds like no one has actually used this
> >     conversion in iri.c, except for trivially converting UTF-8 to
> >     itself.  Is that possible/reasonable?
> Possibly. 
> Could you please give an example string ? I would like to test it on 
> GNU/Linux, BSD and Solaris to see if the output is always the same.

This is what gave me trouble:


This is https://he.wikipedia.org/wiki/ש._שפרה that Andries was using
in his tests, but it's encoded in CP1255 (and hex-encoded after that).
Try converting it into UTF-8, and you will get the last character
chopped off after 'iconv' returns.  Or at least that's what happens
for me.

> >  2. Wget assumes that the URL given on its command line is encoded in
> >     the locale's encoding.  This is a good assumption when the user
> >     herself types the URL at the shell prompt, but not when the URL is
> >     copy-pasted from a browser's address bar.  In the latter case, the
> >     URL tends to be in UTF-8 (sometimes hex-encoded).  At least that's
> >     what I get from Firefox.  We don't seem to have in wget any
> >     facilities to specify a separate (3rd) encoding for the URLs on
> >     the command line, do we?
> I stumbled upon this a while ago when thinking about the design of wget2. And 
> wget2 already has a working --input-encoding option for such cases.
> AFAIK, nobody asked for such an option during the last years - so I assume 
> this to be a somewhat 'expert' or 'fancy' option, at least a low priority one.
> It is an optional goodie.

IMO, it's a sorely missing feature, since copy/pasting URLs from a
browser is something people do very often.  I do it all the time,
because many times wget is much better in downloading large files than
a browser.

