bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] New feature: --restrict-file-names=ascii


From: Micah Cowan
Subject: [Bug-wget] New feature: --restrict-file-names=ascii
Date: Tue, 28 Jul 2009 19:02:49 -0700
User-agent: Thunderbird 2.0.0.22 (X11/20090608)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

A new parameter has been added to the --restrict-file-names option,
"ascii", which forces the percent-encoding of any byte values outside
the range of ASCII characters (that is, greater than 127).

This was to address the shortcoming that there was previously no
appropriate way to invoke wget on URLs encoded in (say) UTF-8, if the
native system encoding (or user's locale, etc) was not UTF-8. Wget's
default behavior would leave most of the high-bytes intact, but _not_
the ones that corresponded to Wget's notion of a control character (in
this case, values in the hexadecimal range 80 - 9F). This meant that
some bytes of a given single UTF-8 character may be encoded, while
others are left intact, resulting in a garbled URL (this is
unfortunately still the current default behavior).

One can prevent this behavior by using --restrict-file-names=nocontrol,
which is great if the remote file name encoding matches your local one;
not so great if it doesn't. Thus, the Powerpuff Girls^W^W "ascii"
parameter was born.

While I was working on this, I also discovered that the new-to-Wget-1.11
values "lowercase" and "uppercase" had not been documented, so I
rectified that as well.

Note, I'm not particularly happy with --restrict-file-names; it tries to
be too many things (especially, now that it has "lowercase" and
"uppercase", it doesn't even necessarily, um, restrict... file names),
and it's somewhat complicated (some of the possible parameters are
mutually exclusive with some others, but not the rest). At some point in
the future, I'll probably want to provide a more general solution that
gives you finer control over exactly what characters get escaped; and
even better, Wget will hopefully be able to transcode file names to the
current locale settings, which should be somewhat more agreeable.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
Maintainer of GNU Wget and GNU Teseq
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkpvrckACgkQ7M8hyUobTrFa0gCcDdcWC+wo+HApMgcLaKcZQTHo
ojsAnR8BRoGd9+NlzS06jeVVZ91pTTEu
=rrq6
-----END PGP SIGNATURE-----




reply via email to

[Prev in Thread] Current Thread [Next in Thread]