Re: [Bug-wget] bad filename

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] bad filename

From:	Andries E. Brouwer
Subject:	Re: [Bug-wget] bad filename
Date:	Sat, 26 Apr 2014 18:27:19 +0200
User-agent:	Mutt/1.5.21 (2010-09-15)

On Fri, Apr 25, 2014 at 09:39:55PM +0300, Bykov Aleksey wrote:
> Greetings, Andries E. Brouwer
> >>- the patch is inside #ifdef WINDOWS ... #endif while the problem
> >> occurs on all systems, also on Unix.
> Yes, it is. 
> >> - Presently, 0-31 and 127-159 are considerd "control". 

> Sorry, i preffer converting. At least for uppercase/lowercase conversion 
> (with towlower() and towupper()). Sometimes it useful - when one site, 
> mirrored with Wget, moved between case-sensitive and case-unsensitive 
> filesystems (ext3 and NTFS).
> Remastered patch, so it has some chance to work in non-windows system. Tested 
> with cyrillic names  in FAT32 and NTFS win32 system. mswindow.diff - only 
> windows related stuff. 
> Best regards, Bykov Aleksey

Hi Aleksey,

I do not know Windows. Can you explain the windows part of the
patch? Why is it a good idea to redefine fopen() and mkdir() etc?
Is it not easiest to leave the filenames as UTF-8?
(And if you make them UTF-16, wouldnt it be better to do that
only depending on some flag?)

What about the setlocale() function? You put it inside #ifndef WINDOWS
... #endif, but in http.c I see it used unprotected. Does it exist
on Windows?

> if (strcasestr(setlocale(LC_CTYPE,NULL),"utf-8")) 

Both en_US.UTF8 and en_US.UTF-8 are common.

> wchar_count = mbstowcs(NULL,b,e-b);
> mbstowcs(w_string,b,e-b);

If I understand things correctly, you have a sequence of e-b bytes
that you want to test. Now the third argument of mbstowcs(dest, src, n)
is the maximum number of wide characters that will be stored at
the destination, not the number of bytes read, so that this code
might read past the end of the string that must be examined.

Often wchar_t is a 16-bit type. But UTF-8 can encode larger values.
There is no guarantee that a conversion is possible.

> wchar_t *pw;
> *pw = towlower (*pw);

The routine towlower() takes and returns a wint_t, and this type
is often wider than wchar_t (since it may want to return either
a wchar_t or an error value WEOF that lies outside the range of wchar_t).

As I said, it is not clear to me why you do these conversions
just to test whether a byte lies in the range 0..31 or 0..127.
I think conversions are unnecessary if one only wants to determine
whether a name should be escaped. (And there are all kinds of corner
cases; e.g., there is no guarantee that mbstowcs and wctomb are inverses.)
Lots of obscure problems are avoided by just not changing these bytes,
only examining them.

Of course conversion is needed for tolower() and toupper().
Maybe that is entirely separate of the question whether control
characters should be escaped.

Andries

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Bug-wget] bad filename, (continued)
- Re: [Bug-wget] bad filename, Bykov Aleksey, 2014/04/23
  - Re: [Bug-wget] bad filename, Ángel González, 2014/04/23
  - Re: [Bug-wget] bad filename, Andries E. Brouwer, 2014/04/23
    - Re: [Bug-wget] bad filename, Bykov Aleksey, 2014/04/25
    - Re: [Bug-wget] bad filename, Andries E. Brouwer <=

Prev by Date: Re: [Bug-wget] bad filename
Next by Date: [Bug-wget] [Bug-Wget][Patch] Implement --show-progress
Previous by thread: Re: [Bug-wget] bad filename
Next by thread: [Bug-wget] mirroring a Blogger blog without the comments
Index(es):
- Date
- Thread