bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] bad filename


From: Tim Ruehsen
Subject: Re: [Bug-wget] bad filename
Date: Wed, 23 Apr 2014 14:43:21 +0200
User-agent: KMail/4.11.5 (Linux/3.13-1-amd64; KDE/4.11.5; x86_64; ; )

On Wednesday 23 April 2014 13:57:15 Andries E. Brouwer wrote:
> On Wed, Apr 23, 2014 at 12:59:43PM +0200, Darshit Shah wrote:
> > On Tue, Apr 22, 2014 at 10:57 PM, Andries E. Brouwer wrote:
> >> If I ask wget to download the wikipedia page
> >> 
> >> http://he.wikipedia.org/wiki/ש._שפרה
> >> 
> >> then I hope for a resulting file ש._שפרה.
> >> Instead, wget gives me ש._שפר\327%94, where the \327
> >> is an unpronounceable byte that cannot be typed
> >> (This is an UTF-8 system and the filename
> >> that wget produces is not valid UTF-8.)
> >> 
> >> Maybe it would be better if wget by default used the original filename.
> >> This name mangling is a vestige of old times, it seems to me.
> > 
> > This is a commonly reported grievance and as you correctly mention a
> > vestige of old times. With UTF-8 supported filesystems, Wget should
> > simply write the correct characters.
> > 
> > I sincerely hope this issue is resolved as fast as possible, but I
> > know not how to. Those who understand i18n should work on this.
> 
> It is very easy to resolve the issue, but I don't know how backwards
> compatible the wget developers want to be.

I guess this is the #1 question ;-)

> The easiest solution is to change the line (in init.c:defaults())
>       opt.restrict_files_ctrl = true;
> into
>       opt.restrict_files_ctrl = false;
> 
> That is what I would like to see:
> the default should be to preserve the name as-is,
> and there should be options "escape_control" or so
> to force the current default behaviour.

You know that you can override default behaviour in ~/.wgetrc (or globally in 
/etc/wgetrc) !? Normally, the distributions package maintainers should care 
about reasonable defaults in /etc/wgetrc. E.g. they could set 
restrictfilenames=nocontrol for UTF-8 environments.
But I understand them being conservative with changes.

See also 'man wget':
"If you specify nocontrol, then the escaping of the control characters is also 
switched off. This option may make sense when you are downloading URLs whose 
names contain UTF-8 characters, on a system which can save and display 
filenames in UTF-8 (some possible byte values used in UTF-8 byte sequences 
fall in the range of values designated by Wget as "controls")."


> There are also more complicated solutions.
> One can ask for LC_CTYPE or LANG or some such thing,
> and try to find out whether the current system is UTF-8,
> and only in that case set restrict_files_ctrl to false.

You can also use
        --local-encoding
and
        --remote-encoding
for more control over the character encoding.

But what if you have an UTF-8 environment and want to use --input-file, 
reading URL's with a ISO-whatever encoding ?
--remote-encoding is not the right one...
Yes, we would need a --input-encoding=...

OT:
Talking about i18n, another point is punycode representation. Meanwhile there 
is IDNA2003, IDNA2008 und the newest TR46 (which mainly cares for some 
incompatibilities between IDNA2003 and IDNA2008). Wget only supports IDNA2003.

> I don't know anything about the Windows environment.

That is a damn good argument not to change the default behaviour... who knows 
exactly all environments where Wget is installed and who is able to code a 
"works everywhere" routine for i18n ?
Back to the top... let the user and/or maintainer configure it - they know 
best what they want.

Can you live with that answer ?

BTW:
We are talking since years about Wget2... having a second tool named 'wget2' 
would allow us to change defaults or correct historically imposed glitches.
I would like to transfer lots of code from my project Mget 
(https://github.com/rockdaboot/mget) into it (i'am tired of maintaining ;-)

Tim




reply via email to

[Prev in Thread] Current Thread [Next in Thread]