[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Improving the handling of system data (env, users, paths, ...)
From: |
Rob Browning |
Subject: |
Improving the handling of system data (env, users, paths, ...) |
Date: |
Sat, 06 Jul 2024 15:32:17 -0500 |
* Problem
System data like environment variables, user names, group names, file
paths and extended attributes (xattr), etc. are on some systems (like
Linux) binary data, and may not be encodable as a string in the current
locale. For Linux, as an example, only the null character is an invalid
user/group/filename byte, while for UTF-8, a much smaller set of bytes
are valid[1].
As an example, "ยต" (Greek Mu) when encoded as Latin-1 is 0xb5, which is
a completely invalid UTF-8 byte, but a perfectly legitimate Linux file
name. As a result, (readdir dir) will return a corrupted value when the
locale is set to UTF-8.
You can try it yourself from bash if your current locale uses an
LC_CTYPE that's incompatible with 0xb5:
$ locale | grep LC_CTYPE
LC_CTYPE="en_US.utf8"
$ guile -c '(write (program-arguments)) (newline)' $'\xb5'
("guile" "?")
You end up with a question mark instead of the correct value. This
makes it difficult to write programs that don't risk silent corruption
unless all the relevant system data is known to be compatible with the
user's current locale.
It's perhaps worth noting, that while typically unlikely, any given
directory could contain paths in an arbitrary collection of encodings:
UTF-8, SHIFT-JIS, Latin-1, etc., and so if you really want to try to
handle them as strings (maybe you want to correctly upcase/downcase
them), you have to know (somehow) the encoding that applies to each one.
Otherwise, in the limiting case, you can only assume "bytes".
* Improvements
At a minimum, I suggest Guile should produce an error by default
(instead of generating incorrect data) when the system bytes cannot be
encoded in the current locale.
There should also be some straightforward, thread-safe way to write code
that accesses and manipulates system data efficiently and without
corruption.
As an incremental step, and as has been discussed elsewhere a bit, we
might add support for uselocale()[2] and then document that the current
recommendation is to always use ISO-8859-1 (i.e. Latin-1)[3] for system
data unless you're certain your program doesn't need to be general
purpose (perhaps you're sure you only care about UTF-8 systems).
A program intended to work everywhere might then do something like
this:
...
#:use-module ((guile locale)
#:select (iso-8859-1 with-locale))
...
(define (environment name)
(with-locale iso-8859-1 (getenv name)))
There are disadvantages to this approach, but it's a fairly easy
improvement.
Some potential disadvantages:
- In cases where the system data was actually UTF-8, non-ASCII
characters will be displayed "completely wrong", i.e. mapped to
"random" other characters according to the Latin-1 correspondences.
- You have to pay whatever cost is involved in switching locales, and
in encoding/decoding the bytes, even if you only care about the
bytes.
- If any manipulations of the string representing the system data end
up performing Unicode canonicalizations or normalizations, the data
could still be corrupted. I don't *think* Guile itself ever does
that implicitly.
- Less importantly, if we switch the internal string representation to
UTF-8 (proposed[4]), then non-ASCII bytes in the data will require
two bytes in memory.
The most direct (and compact, if we do convert to UTF-8) representation
would bytevectors, but then you would have a much more limited set of
operations available (i.e. strings have all of srfi-13, srfi-14, etc.)
unless we expanded them (likely re-using the existing code paths). Of
course you could still convert to Latin-1, perform the operation, and
convert back, but that's not ideal.
Finally, while I'm not sure how I feel about it, one notable precedent
is Python's "surrogateescape" approach[5], which shifts any unencodable
bytes into "lone Unicode surrogates", a process which can (and of course
must) be safely reversed before handing the data back to the system. It
has its own trade-offs/(security)-concerns, as mentioned in the PEP.
[1] https://en.wikipedia.org/wiki/UTF-8#Encoding
[2] https://pubs.opengroup.org/onlinepubs/9699919799/functions/uselocale.html
[3] https://en.wikipedia.org/wiki/ISO/IEC_8859-1
[4] https://codeberg.org/rlb/guile/src/branch/utf8
[5] https://peps.python.org/pep-0383/
Thanks, and I'm happy to help with the implementation of whatever
improvements we choose, if we come to a consensus.
--
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4
- Improving the handling of system data (env, users, paths, ...),
Rob Browning <=
- Re: Improving the handling of system data (env, users, paths, ...), tomas, 2024/07/07
- Re: Improving the handling of system data (env, users, paths, ...), Eli Zaretskii, 2024/07/07
- Re: Improving the handling of system data (env, users, paths, ...), Jean Abou Samra, 2024/07/07
- Re: Improving the handling of system data (env, users, paths, ...), Eli Zaretskii, 2024/07/07
- RE: Improving the handling of system data (env, users, paths, ...), Maxime Devos, 2024/07/07
- Re: Improving the handling of system data (env, users, paths, ...), Eli Zaretskii, 2024/07/07
- RE: Improving the handling of system data (env, users, paths, ...), Maxime Devos, 2024/07/07
- Re: Improving the handling of system data (env, users, paths, ...), Eli Zaretskii, 2024/07/07
- Re: Improving the handling of system data (env, users, paths, ...), Jean Abou Samra, 2024/07/07
- Re: Improving the handling of system data (env, users, paths, ...), Jean Abou Samra, 2024/07/07