[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Patch: Make url_file_name also convert remote path to loc

From: Tim Rühsen
Subject: Re: [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded
Date: Wed, 15 Nov 2017 20:28:17 +0100
User-agent: KMail/5.2.3 (Linux/4.13.0-1-amd64; KDE/5.37.0; x86_64; ; )

On Montag, 13. November 2017 18:32:46 CET Eli Zaretskii wrote:
> > Cc: address@hidden, address@hidden
> > From: Tim Rühsen <address@hidden>
> > Date: Mon, 13 Nov 2017 16:36:39 +0100
> > 
> > > I don't think it's a Gnulib issue.  The problem is that on Windows,
> > > the implicit call at the beginning of Wget
> > > 
> > >   setlocale (LC_ALL, "C");
> > 
> > Why is there an explicit call with "C" ? There is an explicit call with
> > "".
> I said "implicit", not "explicit".  Such an implicit call is made at
> the beginning of every C program, per ANSI C Standard.  Right?
> The MSDN documentation says it clearly:
>   At program startup, the equivalent of the following statement is executed:
>     setlocale( LC_ALL, "C" );
> > From the man page:
> > "If locale is an empty string, "", each part of the locale that should
> > be modified is set according to the environment variables."
> The call with a locale of "" is only done in a build that has
> ENABLE_NLS defined.  I was talking about a build which didn't define
> > > is not good enough to work in multibyte locales of the Far East,
> > > because the Windows runtime assumes a single-byte locale after that
> > > call.  And since Wget happens to need to display text and create files
> > > with non-ASCII characters, it gets hit more than other programs.
> > 
> > I (hopefully) can understand why this doesn't work. NTFS uses UTF-16 for
> > the filenames. If your environment specifies a single-character encoding
> > (e.g. C) and we use at some point a multi-character encoding (e.g.
> > utf-8), then any automatic conversion to UTF-16 filenames are likely to
> > fail. For me the question is: a) does wget has a bug (e.g. creating a
> > filename with a wrong encoded name string or b) does the Windows API has
> > a problem.
> > 
> > > The proposed solution is to add a special call to setlocale which gets
> > > this right on Windows.
> > 
> > Why can't we just convert the filename string into the correct encoding
> > and then create the file ? What do I miss ?
> I guess you are missing a short introduction to the Windows l10n/i18n
> mess.  Let me try.
> First, the fact that NTFS uses UTF-16 is not really relevant.  Wget
> uses 'char *' strings, not 'wchar *' strings to store file names and
> call C library functions that accept file names.  So we cannot use the
> UTF-16 encoding of non-ASCII file names directly.  Instead, we use the
> locale's codepage (the C library and the OS APIs then convert to
> UTF-16 before hitting the disk, but that's not important now).
> Next, creating and opening file names is not the only problem: we need
> also to display these file names and URLs, and that also needs to use
> the encoding expected by the Windows console.
> Now, in any locale which uses single-byte encoding of non-ASCII
> characters, the C locale will support those characters, both for I/O
> and for functions like strcmp, strlen, strcoll, etc.  But not in
> double-byte locales of the Far East: there, you must explicitly call
> setlocale with the correct codepage, to have the local character set
> supported.  This support includes manipulating file names, calling C
> library functions to access files, and displaying non-ASCII text, such
> as file names and URLs, on the console.
> IOW, this is a Windows runtime subtlety that unfortunately needs to be
> fixed in the application code.
> (UTF-8 is not relevant at all here, because Windows doesn't support
> UTF-8 as the locale's codeset; if you try to call setlocale to set
> UTF-8 as the codeset, setlocale will simply fail.  So if we have a
> UTF-8 encoded URL or file name inside wget, we must convert it to the
> current codepage by calling libiconv functions.)
> Does the above make sense?  Let me know if I have to explain some
> more.

Thank you, Eli.

I just wonder if we have the same problem on Linux console as well.
I mean, *not* calling setlocale(LC_ALL, "") (when ENABLE_NLS is undefined) 
would leave the program with the C locale, even if the console/environment has 
something else.

But no one complained so far... so my question:
did you test the patch and does it work for you ?

If yes, I am going to apply it.

Regards, Tim

Attachment: signature.asc
Description: This is a digitally signed message part.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]