Re: Multibyte and unibyte file names

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Multibyte and unibyte file names

From:	Eli Zaretskii
Subject:	Re: Multibyte and unibyte file names
Date:	Sat, 26 Jan 2013 15:16:00 +0200

> From: Stefan Monnier <address@hidden>
> Cc: address@hidden,  address@hidden,  address@hidden
> Date: Sat, 26 Jan 2013 06:34:16 -0500
> 
> >> under what circumstances could such a primitive receive an encoded
> >> file-name, if all the file names returned to Elisp (by things like
> >> directory-files) are already decoded?
> > One way is that a primitive gets called from C.
> 
> So we should fix the (C) caller.

OK, but as long as file-name primitives are required to support
unibyte strings, you cannot be sure these situations won't pop up in
the future.

> I think the right thing to do with unibyte file names is to treat them
> as a sequence of bytes, not a sequence of encoded chars.  If the caller
> doesn't like it, then she should pass a decoded file name instead.

This effectively means we don't support them _as_file_names_.
Because, e.g., testing individual bytes for equality to something like
'\\' can trip on multibyte (DBCS) encodings if the trailing byte
happens to be '\\'.  In general, it isn't "safe" to iterate over these
strings one byte at a time.

> > I "worry" because they need separate code,
> 
> I think if we only support "sequences of bytes" (unibyte strings) and
> "sequenced of decoded chars" (multibyte strings), there is not much need
> for separating the code since there's no risk of a special char (like
> "/", "." or ":") char appearing there while it meant something else.

See above: the risk is real, at least on MS-Windows.  That's what
these bugs I've been mentioning are all about.

> > especially with multibyte encodings; writing that code for an encoding
> > not supported by the current locale is tricky at best, if not
> > downright impossible, and certainly inefficient.
> 
> Better not second guess the caller about which encoding she meant.

We invariably assume that the encoding is given by
file-name-coding-system (or by default-file-name-coding-system, if
file-name-coding-system is nil).  I don't see any reason to support
anything else.  Lisp code can always bind file-name-coding-system if
it needs a different encoding.

> > Are you saying that since this happens
> > infrequently, we could process such file names in a broken way,
> 
> Right.

He, I don't think this will be well accepted.

> > e.g. finding a directory separator where there's none, as demonstrated
> > in http://debbugs.gnu.org/cgi/bugreport.cgi?bug=13515#5?
> 
> That seems like a real bug, tho:

Of course, it's a real bug!  This is what will happen, at least on
Windows, if we decide not to pay attention to encoded file names.
Which is what we do now, in many places.

>    (let ((file-name-coding-system 'cp932))
>      (expand-file-name "表" "C:/"))
> 
> should not return "c:/\225/".  Why does it even pay attention to
> file-name-coding-system?

Because it encodes the file name it passes to dostounix_filename.  And
it does that because dostounix_filename needs optionally to downcase
the name (when w32-downcase-file-names is set).  The way
dostounix_filename downcases file names depends on the current locale,
so it must get encoded file names.

It is easy enough to fix dostounix_filename, so that it doesn't
require encoded file names.  But while I reviewed the code that
calls dostounix_filename, I found that I couldn't figure out what were
the requirements for such code, and that's why I started this thread:
to understand the requirements.  For example, we find this in
file-name-directory:

    while (p != beg && !IS_DIRECTORY_SEP (p[-1])
  #ifdef DOS_NT
           /* only recognize drive specifier at the beginning */
           && !(p[-1] == ':'
                /* handle the "/:d:foo" and "/:foo" cases correctly  */
                && ((p == beg + 2 && !IS_DIRECTORY_SEP (*beg))
                    || (p == beg + 4 && IS_DIRECTORY_SEP (*beg))))
  #endif
           ) p--;

If p points to an encoded file name, we could think we found a
backslash in p[-1], where in fact it's a trailing byte of a multibyte
sequence.  And this is just an example; just search for
IS_DIRECTORY_SEP and you will find quite a bit more.

If file-name-directory manipulates only decoded file names in their
internal representation, then such problems will never happen, because
UTF-8 precludes them.  Thus my question whether we want to support
encoded file names in these primitives as first-class citizens.  And I
still cannot figure out the answer ;-)

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Multibyte and unibyte file names, (continued)

Prev by Date: Re: Multibyte and unibyte file names
Next by Date: Re: Multibyte and unibyte file names
Previous by thread: Re: Multibyte and unibyte file names
Next by thread: Re: Multibyte and unibyte file names
Index(es):
- Date
- Thread