emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Multibyte and unibyte file names


From: Eli Zaretskii
Subject: Re: Multibyte and unibyte file names
Date: Sat, 26 Jan 2013 15:36:36 +0200

> From: "Stephen J. Turnbull" <address@hidden>
> Cc: address@hidden
> Date: Sat, 26 Jan 2013 22:03:28 +0900
> 
> Eli Zaretskii writes:
> 
>  > > "Unibyte" as implemented in Emacs is a premature optimization, and a
>  > > disaster in search of places to happen.  Remove it, and you'll never
>  > > notice it's gone.  The consequence of that removal would be to fix
>  > > this problem, permanently.
>  > 
>  > I don't think you are entirely correct.
> 
> My preferred flavor of Emacs never had unibyte.  It's got its problems
> in this area, but they're just lazy or over-ambitious programmer bugs,
> not a design flaw.

I can't reason about something I know nothing about.  So this is not a
useful argument.

>  > We still need to send encoded (unibyte) strings to the outside
>  > world.
> 
> Of course.  In fact, pretty much all interaction with the outside
> world involves byte streams.  The problem Emacs is experiencing here
> is that Lisp can see bytes when it is designed only to work with
> characters.

In GNU Emacs, Lisp can work with bytes as well.

>  > [Determining file name encoding] a non-issue: we treat unibyte file
>  > names as encoded in file-name-coding-system.  Nothing else is
>  > supported, or needed.
> 
> It is in Japan, where it's still common to have a host whose hard
> drive uses UTF-8, mounting EUC-JP-encoded volumes over NFS, and USB
> drives with Shift-JIS file names.  I've even seen file names
> containing segments encoded variously in KOI8, Shift JIS, *and* EUC-JP
> (in Macintosh notation, no less).  Admittedly, not in a very long
> time, but it's still *possible* to do that on POSIX systems.
> 
> You just can't win in this environment; you will see mojibake, and
> sometimes undecodable names, unless you get help from the user.  Such
> names can be round-tripped using special "undecodable bytes"
> representation (UTF-8B or non-unicode code points).  But if you try to
> manipulate those names in Lisp, you will sometimes get incorrect
> results.

That's OK.  Emacs cannot solve these situations, and I didn't try to
target them.  I will be happy enough to correctly support file names
consistently encoded in a single encoding that is the value of
file-name-coding-system.  I hope you will agree that having _that_
broken is not good.

>  > Exactly.  Moreover, what you suggest is a large project that won't
>  > happen without a motivated individual.  Given the overall "cannot
>  > happen on POSIX, so it's SEP"
> 
> It can easily happen on POSIX systems, especially with removable media
> or double-booting hosts.

If you look back at this thread, you will see that this is what I
tried to say, but was consistently told that Posix systems have no
such problems "in practice".

> But I don't see why it should be so difficult.  You already have all
> the functions needed to decode byte streams to Lisp strings or
> buffers, and that's the normal mode of operation, no?

Decoding is not a problem, but it hampers efficiency.  There's also an
associated problem that decoding a file can GC, which is not good for
functions that get 'char *' pointers as arguments.  Therefore, it is
best avoided (although we do use it when we have no choice, e.g., when
we need to produce a file name from a unibyte directory and a
multibyte file name).

> In fact AFAIK the set of programs that use the unibyte feature at
> all is pretty small, and most of those (like Tramp) do so only in
> self-defense.

You are thinking on the wrong level.  The problem rears its ugly head
on the C level, not on the Lisp level.  Functions in dired.c and
fileio.c manipulate file names, assuming it is safe to address
individual bytes even if the file name is in some DBCS encoding.  I
gave one example a few messages ago.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]