Re: [Bug-wget] bad filenames (again)

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] bad filenames (again)

From:	Andries E. Brouwer
Subject:	Re: [Bug-wget] bad filenames (again)
Date:	Sun, 16 Aug 2015 22:21:20 +0200
User-agent:	Mutt/1.5.21 (2010-09-15)

On Sun, Aug 16, 2015 at 05:43:50PM +0300, Eli Zaretskii wrote:

(i)

>> #if defined(WINDOWS) || defined(MSDOS) || defined(__CYGWIN__)
>>   /* insert some test for Windows */
>> #else
>>  ... code that uses getenv to test LC_ALL, LC_CTYPE, LANG ...
>> #endif

> I'm not sure this is the right way to fix this.  First, relying on
> UTF-8 locale to be announced in the environment is less portable than
> it could be: it's better to call 'setlocale' with the 2nd argument
> NULL to glean the same information.  Then the ugly #ifdef above could
> be dropped, and at least Cygwin will not be excluded from this
> feature.

I left the wget behaviour for MSDOS / Windows / Cygwin unchanged
because I do not know anything about these platforms. It is quite
possible that the #ifdef is unneeded.

Are you saying that it in fact is needed when getenv() is used,
but unneeded when setlocale() is used? And then what about LANG?

(ii)

> Moreover, even if the locale is not UTF-8, wget should attempt to
> convert the file names to the current locale using iconv (which I
> believe was what Tim suggested).  This will DTRT in much more cases
> than the above UTF-8 centric approach, IMO.

Hmm. My own point of view is almost the opposite. In my life I have
spent countless hours trying to repair the damage done by software
that helpfully modified my data.
I prefer my data as-is, unless I explicitly ask for conversion.

I think Tim suggested something else (namely, just checking whether
the filename was valid UTF-8), but never mind.

The patch enlarges the number of cases where the original data
is preserved. Yes, I am all in favour of enlarging that number of
cases even further. This is only a first step. But in my eyes
applying iconv would be a step back. It can be really tricky to
decode the mojibake obtained by converting A to C, while
the original really was in B.
How do you guess the original character set?
What should happen when iconv() returns EILSEQ?

Andries

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Bug-wget] bad filenames (again), (continued)

Prev by Date: Re: [Bug-wget] bad filenames (again)
Next by Date: Re: [Bug-wget] [PATCH] bad filenames (again)
Previous by thread: Re: [Bug-wget] bad filenames (again)
Next by thread: Re: [Bug-wget] bad filenames (again)
Index(es):
- Date
- Thread