Re: Multibyte and unibyte file names

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Multibyte and unibyte file names

From:	Stefan Monnier
Subject:	Re: Multibyte and unibyte file names
Date:	Sat, 26 Jan 2013 17:11:25 -0500
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/24.3.50 (gnu/linux)

> OK, but as long as file-name primitives are required to support
> unibyte strings, you cannot be sure these situations won't pop up in
> the future.

I don't see a need to disallow unibyte strings, but I don't see the need
to be particularly careful about it either.  Basically Elisp code which
provides unibyte file names does it at its own risks.

>> I think the right thing to do with unibyte file names is to treat them
>> as a sequence of bytes, not a sequence of encoded chars.  If the caller
>> doesn't like it, then she should pass a decoded file name instead.
> This effectively means we don't support them _as_file_names_.
> Because, e.g., testing individual bytes for equality to something like
> '\\' can trip on multibyte (DBCS) encodings if the trailing byte
> happens to be '\\'.  In general, it isn't "safe" to iterate over these
> strings one byte at a time.

But that's exactly the behavior stipulated by POSIX (tho for '/' rather
than '\\').  I.e. if you use file names on a POSIX host with
a coding-system that occasionally uses '/' within its multibyte
sequences, you'll get those surprises regardless of Emacs.  And for that
reason, Emacs would be right to cut those file names in the middle of
a multibyte sequence.

IIUC that's what makes this a "w32-only problem", because the w32
semantics for file names is based on characters, so a '\\' (or a '/')
appearing with a multibyte sequence is not considered by the OS as
a separator.

And since Emacs is largely based on "POSIX semantics for the generic
code, plus an emulation layer in w32.c", we have a problem of subtly
incompatible semantics.

>> > Are you saying that since this happens
>> > infrequently, we could process such file names in a broken way,
>> Right.
> He, I don't think this will be well accepted.

I haven't heard too many screams about this over the years.

>> > e.g. finding a directory separator where there's none, as demonstrated
>> > in http://debbugs.gnu.org/cgi/bugreport.cgi?bug=13515#5?
>> That seems like a real bug, tho:
> Of course, it's a real bug!  This is what will happen, at least on
> Windows, if we decide not to pay attention to encoded file names.
> Which is what we do now, in many places.

>> (let ((file-name-coding-system 'cp932))
>> (expand-file-name "表" "C:/"))
>> should not return "c:/\225/".  Why does it even pay attention to
>> file-name-coding-system?
> Because it encodes the file name it passes to dostounix_filename.

Why? [ OK, I see the answer follows.. ]

> And it does that because dostounix_filename needs optionally to
> downcase the name (when w32-downcase-file-names is set).

Hmm.. but downcasing is an operation on chars, not on bytes, so it
should be applied to decoded names, right?

> The way dostounix_filename downcases file names depends on the current
> locale, so it must get encoded file names.

Are you saying that the "downcase" function is not Emacs's own but is
a function provided by the OS, so we need to encode the name to pass it
to that function?  If so, we need to immediately decode the result.
(and of course this encode+downcase+decode is only done if
w32-downcase-file-names is set).

Alternatively, we could use Emacs's own downcasing function, which does
not depend on the locale and operates directly on decoded names.

> If p points to an encoded file name, we could think we found a
> backslash in p[-1], where in fact it's a trailing byte of a multibyte
> sequence.  And this is just an example; just search for
> IS_DIRECTORY_SEP and you will find quite a bit more.

As explained elsewhere such "spurious directory separator within
a multibyte char" has a different meaning under w32 than under POSIX.
The current Emacs code is correct in this respect under POSIX (as odd as
it may sound).

Luckily the problem should only appear if such code is run on unibyte
names and that should be rare enough (in the generic part of the C code)
that we don't need to worry about it.

But indeed for uses of IS_DIRECTORY_SEP in w32.c this is probably more
serious since those functions emulate POSIX calls, so they always receive
encoded file names.

> UTF-8 precludes them.  Thus my question whether we want to support
> encoded file names in these primitives as first-class citizens.

Could you specify a bit more precisely which primitives you have
in mind?


        Stefan

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Multibyte and unibyte file names, (continued)

Prev by Date: new image-type: composite, or not
Next by Date: Re: Multibyte and unibyte file names
Previous by thread: Re: Multibyte and unibyte file names
Next by thread: Re: Multibyte and unibyte file names
Index(es):
- Date
- Thread