guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Improving the handling of system data (env, users, paths, ...)


From: Eli Zaretskii
Subject: Re: Improving the handling of system data (env, users, paths, ...)
Date: Sun, 07 Jul 2024 08:33:01 +0300

> From: Rob Browning <rlb@defaultvalue.org>
> Date: Sat, 06 Jul 2024 15:32:17 -0500
> 
> * Problem
> 
> System data like environment variables, user names, group names, file
> paths and extended attributes (xattr), etc. are on some systems (like
> Linux) binary data, and may not be encodable as a string in the current
> locale.  For Linux, as an example, only the null character is an invalid
> user/group/filename byte, while for UTF-8, a much smaller set of bytes
> are valid[1].
> 
> As an example, "ยต" (Greek Mu) when encoded as Latin-1 is 0xb5, which is
> a completely invalid UTF-8 byte, but a perfectly legitimate Linux file
> name.  As a result, (readdir dir) will return a corrupted value when the
> locale is set to UTF-8.
> 
> You can try it yourself from bash if your current locale uses an
> LC_CTYPE that's incompatible with 0xb5:
> 
>     $ locale | grep LC_CTYPE
>     LC_CTYPE="en_US.utf8"
>     $ guile -c '(write (program-arguments)) (newline)' $'\xb5'
>     ("guile" "?")
> 
> You end up with a question mark instead of the correct value.  This
> makes it difficult to write programs that don't risk silent corruption
> unless all the relevant system data is known to be compatible with the
> user's current locale.
> 
> It's perhaps worth noting, that while typically unlikely, any given
> directory could contain paths in an arbitrary collection of encodings:
> UTF-8, SHIFT-JIS, Latin-1, etc., and so if you really want to try to
> handle them as strings (maybe you want to correctly upcase/downcase
> them), you have to know (somehow) the encoding that applies to each one.
> Otherwise, in the limiting case, you can only assume "bytes".

Why not learn from GNU Emacs, which already solved this very hard
problem, and has many years of user and programming experience to
prove it, instead of inventing Guile's own solution?

Here's what we learned in Emacs, since 1997 (when Emacs 20.1 was
released that for the first time tried to provide an environment that
supports multiple languages and encodings at the same time);

 . Locales are not a good mechanism for this.  A locale supports a
   single language/encoding, and switching the locale each time you
   need a different one is costly and makes many simple operations
   cumbersome, and the code hard to read.
 . It follows that relying on libc functions that process non-ASCII
   characters is also not the best idea: those functions depend on the
   locale, and thus force the programmer to use locales and switch
   them as needed.
 . Byte sequences that cannot be decoded for some reason are a fact of
   life, and any real-life programming system must be able to deal
   with them in a reasonable and efficient way.
 . Therefore, Emacs has arrived at the following system, and we use it
   for the last 15 years without any significant changes:

    - When text is read from an external source, it is _decoded_ into
      the internal representation of characters.  When text is written
      to an external destination, it is _encoded_ using an appropriate
      codeset.
    - The internal representation is a superset of UTF-8, in that it
      is capable of representing characters for which there are no
      Unicode codepoints (such as GB 18030, some of whose characters
      don't have Unicode counterparts; and raw bytes, used to
      represent byte sequences that cannot be decoded).  It uses
      5-byte UTF-8-like sequences for these extensions.
    - The codesets used to decode and encode can be selected by simple
      settings, and have defaults which are locale- and
      language-aware.  When the encoding of external text is not
      known, Emacs uses a series of guesses, driven by the locale, the
      nature of the source (e.g., file name), user preferences, etc.
      Encoding generally reuses the same codeset used to decode (which
      is recorded with the text), and the Lisp program can override
      that.
    - Separate global variables and corresponding functions are
      provided for decoding/encoding stuff that comes from several
      important sources and goes to the corresponding destinations.
      Examples include en/decoding of file names, en/decoding of text
      from files, en/decoding values of environment variables and
      system messages (e.g., messages from strerror), and en/decoding
      text from subordinate processes.  Each of these gets the default
      value based on the locale and the language detected at startup,
      but a Lisp program can modify each one of them, either
      temporarily or globally.  There are also facilities for adapting
      these to specific requirements of particular external sources
      and destinations: for example, one can define special codesets
      for encoding and decoding text from/to specific programs run by
      Emacs, based on the program names.  (E.g., Git generally wants
      UTF-8 encoding regardless of the locale.)  Similarly, some
      specific file names are known to use certain encodings.  All of
      these are used to determine the proper codeset when the caller
      didn't specify one.
    - Emacs has its own code for code-conversion, for moving by
      characters through multibyte sequences, for producing a Unicode
      codepoint from a byte sequence in the super-UTF-8 representation
      and back, etc., so it doesn't use libc routines for that, and
      thus doesn't depend on the current locale for these operations.
    - APIs are provided for "manual" encoding and decoding.  A Lisp
      program can read a byte stream, then decode it "manually" using
      a particular codeset, as deemed appropriate.  This allows to
      handle complex situations where a program receives stuff whose
      encoding can only be determined by examining the raw byte stream
      (a typical example is a multipart email message with MIME
      charset header for each part).
    - Emacs also has tables of Unicode attributes of characters
      (produced by parsing the relevant Unicode data files at build
      time), so it can up/down-case characters, determine their
      category (letters, digits, punctuation, etc.) and script to
      which they belong, etc. -- all with its own code, independent of
      the underlying libc.

This is no doubt a complex system that needs a lot of code.  But it
does work, and works well, as proven by years of experience.  Nowadays
at least some of the functionality can be found in free libraries
which Guile could perhaps use, instead of rolling its own
implementations.  And the code used by Emacs is, of course, freely
available for study and reuse.

> At a minimum, I suggest Guile should produce an error by default
> (instead of generating incorrect data) when the system bytes cannot be
> encoded in the current locale.

In our experience, this is a mistake.  Signaling an error for each
decoding problem produces unreliable applications that punt in too
many cases.  Emacs leaves the problematic bytes alone, as raw bytes
(which are representable in the internal representation, see above),
and leaves it to higher-level application code or to the user to deal
with the results.  The "generation of incorrect data" alternative is
thus avoided, because Emacs does not replace undecodable bytes with
something else.

> As an incremental step, and as has been discussed elsewhere a bit, we
> might add support for uselocale()[2] and then document that the current
> recommendation is to always use ISO-8859-1 (i.e. Latin-1)[3] for system
> data unless you're certain your program doesn't need to be general
> purpose (perhaps you're sure you only care about UTF-8 systems).

A Latin-1 locale comes with its baggage of rules, for example up- and
down-casing, character classification (letters vs punctuation etc.),
and other stuff.  Representing raw bytes pretending they are Latin-1
characters is therefore problematic and will lead to programmatic
errors, whereby a program cannot distinguish between a raw byte and a
Latin-1 character that have the same 8-bit value.

Feel free to ask any questions about the details.

HTH



reply via email to

[Prev in Thread] Current Thread [Next in Thread]