Re: [Bug-wget] Patch: Make url_file_name also convert remote path to loc

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Patch: Make url_file_name also convert remote path to loc

From:	Eli Zaretskii
Subject:	Re: [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded
Date:	Mon, 13 Nov 2017 18:32:46 +0200

> Cc: address@hidden, address@hidden
> From: Tim Rühsen <address@hidden>
> Date: Mon, 13 Nov 2017 16:36:39 +0100
> 
> > I don't think it's a Gnulib issue.  The problem is that on Windows,
> > the implicit call at the beginning of Wget
> > 
> >   setlocale (LC_ALL, "C");
> 
> Why is there an explicit call with "C" ? There is an explicit call with "".

I said "implicit", not "explicit".  Such an implicit call is made at
the beginning of every C program, per ANSI C Standard.  Right?

The MSDN documentation says it clearly:

  At program startup, the equivalent of the following statement is executed:

    setlocale( LC_ALL, "C" );

> From the man page:
> "If locale is an empty string, "", each part of the locale that should
> be modified is set according to the environment variables."

The call with a locale of "" is only done in a build that has
ENABLE_NLS defined.  I was talking about a build which didn't define
ENABLE_NLS.

> > is not good enough to work in multibyte locales of the Far East,
> > because the Windows runtime assumes a single-byte locale after that
> > call.  And since Wget happens to need to display text and create files
> > with non-ASCII characters, it gets hit more than other programs.
> 
> I (hopefully) can understand why this doesn't work. NTFS uses UTF-16 for
> the filenames. If your environment specifies a single-character encoding
> (e.g. C) and we use at some point a multi-character encoding (e.g.
> utf-8), then any automatic conversion to UTF-16 filenames are likely to
> fail. For me the question is: a) does wget has a bug (e.g. creating a
> filename with a wrong encoded name string or b) does the Windows API has
> a problem.
> 
> > The proposed solution is to add a special call to setlocale which gets
> > this right on Windows.
> 
> Why can't we just convert the filename string into the correct encoding
> and then create the file ? What do I miss ?

I guess you are missing a short introduction to the Windows l10n/i18n
mess.  Let me try.

First, the fact that NTFS uses UTF-16 is not really relevant.  Wget
uses 'char *' strings, not 'wchar *' strings to store file names and
call C library functions that accept file names.  So we cannot use the
UTF-16 encoding of non-ASCII file names directly.  Instead, we use the
locale's codepage (the C library and the OS APIs then convert to
UTF-16 before hitting the disk, but that's not important now).

Next, creating and opening file names is not the only problem: we need
also to display these file names and URLs, and that also needs to use
the encoding expected by the Windows console.

Now, in any locale which uses single-byte encoding of non-ASCII
characters, the C locale will support those characters, both for I/O
and for functions like strcmp, strlen, strcoll, etc.  But not in
double-byte locales of the Far East: there, you must explicitly call
setlocale with the correct codepage, to have the local character set
supported.  This support includes manipulating file names, calling C
library functions to access files, and displaying non-ASCII text, such
as file names and URLs, on the console.

IOW, this is a Windows runtime subtlety that unfortunately needs to be
fixed in the application code.

(UTF-8 is not relevant at all here, because Windows doesn't support
UTF-8 as the locale's codeset; if you try to call setlocale to set
UTF-8 as the codeset, setlocale will simply fail.  So if we have a
UTF-8 encoded URL or file name inside wget, we must convert it to the
current codepage by calling libiconv functions.)

Does the above make sense?  Let me know if I have to explain some
more.

[Prev in Thread]

Current Thread

[Next in Thread]

[Bug-wget] Patch: Make url_file_name also convert remote path to local encoded, YX Hao, 2017/11/02
- Re: [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded, Tim Rühsen, 2017/11/12
  - Re: [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded, Eli Zaretskii, 2017/11/12
    - Re: [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded, Yuxi Hao, 2017/11/13
    - Re: [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded, Tim Rühsen, 2017/11/13
    - Re: [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded, Eli Zaretskii <=
    - Re: [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded, Yuxi Hao, 2017/11/14
    - Re: [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded, Tim Rühsen, 2017/11/15
    - Re: [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded, Eli Zaretskii, 2017/11/15
  - Re: [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded, Yuxi Hao, 2017/11/13
    - Re: [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded, Tim Rühsen, 2017/11/15
  - Re: [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded, Yuxi Hao, 2017/11/13

Prev by Date: Re: [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded
Next by Date: [Bug-wget] Bug?
Previous by thread: Re: [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded
Next by thread: Re: [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded
Index(es):
- Date
- Thread