[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Improving the handling of system data (env, users, paths, ...)
From: |
Eli Zaretskii |
Subject: |
Re: Improving the handling of system data (env, users, paths, ...) |
Date: |
Sun, 07 Jul 2024 08:33:01 +0300 |
> From: Rob Browning <rlb@defaultvalue.org>
> Date: Sat, 06 Jul 2024 15:32:17 -0500
>
> * Problem
>
> System data like environment variables, user names, group names, file
> paths and extended attributes (xattr), etc. are on some systems (like
> Linux) binary data, and may not be encodable as a string in the current
> locale. For Linux, as an example, only the null character is an invalid
> user/group/filename byte, while for UTF-8, a much smaller set of bytes
> are valid[1].
>
> As an example, "ยต" (Greek Mu) when encoded as Latin-1 is 0xb5, which is
> a completely invalid UTF-8 byte, but a perfectly legitimate Linux file
> name. As a result, (readdir dir) will return a corrupted value when the
> locale is set to UTF-8.
>
> You can try it yourself from bash if your current locale uses an
> LC_CTYPE that's incompatible with 0xb5:
>
> $ locale | grep LC_CTYPE
> LC_CTYPE="en_US.utf8"
> $ guile -c '(write (program-arguments)) (newline)' $'\xb5'
> ("guile" "?")
>
> You end up with a question mark instead of the correct value. This
> makes it difficult to write programs that don't risk silent corruption
> unless all the relevant system data is known to be compatible with the
> user's current locale.
>
> It's perhaps worth noting, that while typically unlikely, any given
> directory could contain paths in an arbitrary collection of encodings:
> UTF-8, SHIFT-JIS, Latin-1, etc., and so if you really want to try to
> handle them as strings (maybe you want to correctly upcase/downcase
> them), you have to know (somehow) the encoding that applies to each one.
> Otherwise, in the limiting case, you can only assume "bytes".
Why not learn from GNU Emacs, which already solved this very hard
problem, and has many years of user and programming experience to
prove it, instead of inventing Guile's own solution?
Here's what we learned in Emacs, since 1997 (when Emacs 20.1 was
released that for the first time tried to provide an environment that
supports multiple languages and encodings at the same time);
. Locales are not a good mechanism for this. A locale supports a
single language/encoding, and switching the locale each time you
need a different one is costly and makes many simple operations
cumbersome, and the code hard to read.
. It follows that relying on libc functions that process non-ASCII
characters is also not the best idea: those functions depend on the
locale, and thus force the programmer to use locales and switch
them as needed.
. Byte sequences that cannot be decoded for some reason are a fact of
life, and any real-life programming system must be able to deal
with them in a reasonable and efficient way.
. Therefore, Emacs has arrived at the following system, and we use it
for the last 15 years without any significant changes:
- When text is read from an external source, it is _decoded_ into
the internal representation of characters. When text is written
to an external destination, it is _encoded_ using an appropriate
codeset.
- The internal representation is a superset of UTF-8, in that it
is capable of representing characters for which there are no
Unicode codepoints (such as GB 18030, some of whose characters
don't have Unicode counterparts; and raw bytes, used to
represent byte sequences that cannot be decoded). It uses
5-byte UTF-8-like sequences for these extensions.
- The codesets used to decode and encode can be selected by simple
settings, and have defaults which are locale- and
language-aware. When the encoding of external text is not
known, Emacs uses a series of guesses, driven by the locale, the
nature of the source (e.g., file name), user preferences, etc.
Encoding generally reuses the same codeset used to decode (which
is recorded with the text), and the Lisp program can override
that.
- Separate global variables and corresponding functions are
provided for decoding/encoding stuff that comes from several
important sources and goes to the corresponding destinations.
Examples include en/decoding of file names, en/decoding of text
from files, en/decoding values of environment variables and
system messages (e.g., messages from strerror), and en/decoding
text from subordinate processes. Each of these gets the default
value based on the locale and the language detected at startup,
but a Lisp program can modify each one of them, either
temporarily or globally. There are also facilities for adapting
these to specific requirements of particular external sources
and destinations: for example, one can define special codesets
for encoding and decoding text from/to specific programs run by
Emacs, based on the program names. (E.g., Git generally wants
UTF-8 encoding regardless of the locale.) Similarly, some
specific file names are known to use certain encodings. All of
these are used to determine the proper codeset when the caller
didn't specify one.
- Emacs has its own code for code-conversion, for moving by
characters through multibyte sequences, for producing a Unicode
codepoint from a byte sequence in the super-UTF-8 representation
and back, etc., so it doesn't use libc routines for that, and
thus doesn't depend on the current locale for these operations.
- APIs are provided for "manual" encoding and decoding. A Lisp
program can read a byte stream, then decode it "manually" using
a particular codeset, as deemed appropriate. This allows to
handle complex situations where a program receives stuff whose
encoding can only be determined by examining the raw byte stream
(a typical example is a multipart email message with MIME
charset header for each part).
- Emacs also has tables of Unicode attributes of characters
(produced by parsing the relevant Unicode data files at build
time), so it can up/down-case characters, determine their
category (letters, digits, punctuation, etc.) and script to
which they belong, etc. -- all with its own code, independent of
the underlying libc.
This is no doubt a complex system that needs a lot of code. But it
does work, and works well, as proven by years of experience. Nowadays
at least some of the functionality can be found in free libraries
which Guile could perhaps use, instead of rolling its own
implementations. And the code used by Emacs is, of course, freely
available for study and reuse.
> At a minimum, I suggest Guile should produce an error by default
> (instead of generating incorrect data) when the system bytes cannot be
> encoded in the current locale.
In our experience, this is a mistake. Signaling an error for each
decoding problem produces unreliable applications that punt in too
many cases. Emacs leaves the problematic bytes alone, as raw bytes
(which are representable in the internal representation, see above),
and leaves it to higher-level application code or to the user to deal
with the results. The "generation of incorrect data" alternative is
thus avoided, because Emacs does not replace undecodable bytes with
something else.
> As an incremental step, and as has been discussed elsewhere a bit, we
> might add support for uselocale()[2] and then document that the current
> recommendation is to always use ISO-8859-1 (i.e. Latin-1)[3] for system
> data unless you're certain your program doesn't need to be general
> purpose (perhaps you're sure you only care about UTF-8 systems).
A Latin-1 locale comes with its baggage of rules, for example up- and
down-casing, character classification (letters vs punctuation etc.),
and other stuff. Representing raw bytes pretending they are Latin-1
characters is therefore problematic and will lead to programmatic
errors, whereby a program cannot distinguish between a raw byte and a
Latin-1 character that have the same 8-bit value.
Feel free to ask any questions about the details.
HTH
- Improving the handling of system data (env, users, paths, ...), Rob Browning, 2024/07/06
- Re: Improving the handling of system data (env, users, paths, ...), tomas, 2024/07/07
- Re: Improving the handling of system data (env, users, paths, ...),
Eli Zaretskii <=
- Re: Improving the handling of system data (env, users, paths, ...), Jean Abou Samra, 2024/07/07
- Re: Improving the handling of system data (env, users, paths, ...), Eli Zaretskii, 2024/07/07
- RE: Improving the handling of system data (env, users, paths, ...), Maxime Devos, 2024/07/07
- Re: Improving the handling of system data (env, users, paths, ...), Eli Zaretskii, 2024/07/07
- RE: Improving the handling of system data (env, users, paths, ...), Maxime Devos, 2024/07/07
- Re: Improving the handling of system data (env, users, paths, ...), Eli Zaretskii, 2024/07/07
- Re: Improving the handling of system data (env, users, paths, ...), Jean Abou Samra, 2024/07/07
- Re: Improving the handling of system data (env, users, paths, ...), Jean Abou Samra, 2024/07/07
- Re: Improving the handling of system data (env, users, paths, ...), Eli Zaretskii, 2024/07/07
- Re: Improving the handling of system data (env, users, paths, ...), Jean Abou Samra, 2024/07/07
- Prev by Date:
Re: Improving the handling of system data (env, users, paths, ...)
- Next by Date:
Re: Improving the handling of system data (env, users, paths, ...)
- Previous by thread:
Re: Improving the handling of system data (env, users, paths, ...)
- Next by thread:
Re: Improving the handling of system data (env, users, paths, ...)
- Index(es):