[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Improving the handling of system data (env, users, paths, ...)
From: |
Rob Browning |
Subject: |
Re: Improving the handling of system data (env, users, paths, ...) |
Date: |
Sun, 07 Jul 2024 14:25:06 -0500 |
Jean Abou Samra <jean@abou-samra.fr> writes:
> latin1 locale is a terrible default. Virtually no Linux system these days
> has a locale encoding different than UTF-8. Except perhaps for the "C" locale,
> which people still use by habit with "LC_ALL=C" as a way to say "speak English
> please", although most Linux distros have a C.UTF-8 locale these days.
Given this thread, it might have been good if I'd included a few other
bits of context in my original post.
- Personally, as someone who spends a lot of time on tool that's more
like tar/cp/rsync/etc. (and I suspect this sentiment applies for
anyone doing something similar), I'd be happier without "help",
i.e. at a minimum, I'd prefer solid bytevector support, and then
I'll handle any conversions when needed.
But I was trying to propose something incremental that comports with
previous (off-list discussions), i.e. something that might be
acceptable in the near to medium term.
In truth, for system tools, I have no interest in "strings" most of
the time, and would rather not pay anything for them (imagine
regularly processing a few hundred million filesystem paths), and if
I *do* care (say for regular-expression based exclusions), then "OK,
first you have to tell us where the paths came from", i.e. we have
no way of knowing what the encodings are, other than guessing.
That said, I'd be more than happy to have *help*, e.g. bytevector
variants of various srfi-13/srfi-14 functions, and/or (as I think
suggested elsewhere in the thread) maybe even some hybrid type with
additional conveniences (if that were to make sense).
Further, you could imagine having more specific types like the
"path" type many languages have, depending on what your
cross-platform goals are, since paths aren't "just bytes"
everywhere, something which even varies in Linux per-filesystem type
-- but I didn't consider any of that "in scope" for now.
- Using Latin-1 is of course, a hack, a pragmatic hack, but a hack,
(it wasn't even my suggestion, originally). Choosing that "for now"
would just be trying to take advantage of the facts that it's likely
to pass-through without corruption, and still allows easier
manipulation via the existing string apis for some common, important
cases, i.e. where you can still get the job done while only
referring to the ascii bits (split/join on "/", for example), but
no, it's not ideal.
It also intends to avoid having to decide, and to do, anything
further (in the short term) regarding all the existing *many*
relevant system calls. You can just call them as-is with a
temporarily adjusted locale.
- I have no idea where Guile might eventually end up, but given
current resources, it seemed likely that what's potentially in scope
for now is "incremental".
I'll also say that the broader discussion is interesting, and I do like
to better understand how other systems work.
> Le samedi 06 juillet 2024 à 15:32 -0500, Rob Browning a écrit :
>
>> The most direct (and compact, if we do convert to UTF-8) representation
>> would bytevectors, but then you would have a much more limited set of
>> operations available (i.e. strings have all of srfi-13, srfi-14, etc.)
>> unless we expanded them (likely re-using the existing code paths). Of
>> course you could still convert to Latin-1, perform the operation, and
>> convert back, but that's not ideal.
> Why is that "not ideal"? The (ice-9 iconv) API is convenient,
> locale-independent
> and thread-safe.
I meant that round-tripping through Latin-1 every time you want to call
say string-split on "/" isn't ideal as compared to a bytevector friendly
splitter. And if we do switch to UTF-8 internally, it'll also require
copying/converting the bytes since non-ascii bytes become multibyte.
(Given the UTF-8 work, I've also speculated about the fact that we could
probably re-use many, if not all of the optimized "ascii paths" that
I've included in the various functions there (srfi-13, srf-14, etc.),
to implement bytevector friendly variants without much additional
work.)
Thanks
--
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4
- RE: Improving the handling of system data (env, users, paths, ...), (continued)
- RE: Improving the handling of system data (env, users, paths, ...), Maxime Devos, 2024/07/07
- Re: Improving the handling of system data (env, users, paths, ...), Eli Zaretskii, 2024/07/07
- RE: Improving the handling of system data (env, users, paths, ...), Maxime Devos, 2024/07/07
- Re: Improving the handling of system data (env, users, paths, ...), Eli Zaretskii, 2024/07/07
- Re: Improving the handling of system data (env, users, paths, ...), Jean Abou Samra, 2024/07/07
- Re: Improving the handling of system data (env, users, paths, ...), Jean Abou Samra, 2024/07/07
- Re: Improving the handling of system data (env, users, paths, ...), Eli Zaretskii, 2024/07/07
- Re: Improving the handling of system data (env, users, paths, ...), Jean Abou Samra, 2024/07/07
- Re: Improving the handling of system data (env, users, paths, ...), Mike Gran, 2024/07/07
Re: Improving the handling of system data (env, users, paths, ...), Jean Abou Samra, 2024/07/07
- Re: Improving the handling of system data (env, users, paths, ...),
Rob Browning <=
RE: Improving the handling of system data (env, users, paths, ...), Maxime Devos, 2024/07/07
- Prev by Date:
Re: Improving the handling of system data (env, users, paths, ...)
- Next by Date:
RE: Improving the handling of system data (env, users, paths, ...)
- Previous by thread:
Re: Improving the handling of system data (env, users, paths, ...)
- Next by thread:
RE: Improving the handling of system data (env, users, paths, ...)
- Index(es):