guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Improving the handling of system data (env, users, paths, ...)


From: Rob Browning
Subject: Re: Improving the handling of system data (env, users, paths, ...)
Date: Sun, 07 Jul 2024 14:25:06 -0500

Jean Abou Samra <jean@abou-samra.fr> writes:

> latin1 locale is a terrible default. Virtually no Linux system these days
> has a locale encoding different than UTF-8. Except perhaps for the "C" locale,
> which people still use by habit with "LC_ALL=C" as a way to say "speak English
> please", although most Linux distros have a C.UTF-8 locale these days.

Given this thread, it might have been good if I'd included a few other
bits of context in my original post.

  - Personally, as someone who spends a lot of time on tool that's more
    like tar/cp/rsync/etc. (and I suspect this sentiment applies for
    anyone doing something similar), I'd be happier without "help",
    i.e. at a minimum, I'd prefer solid bytevector support, and then
    I'll handle any conversions when needed.

    But I was trying to propose something incremental that comports with
    previous (off-list discussions), i.e. something that might be
    acceptable in the near to medium term.

    In truth, for system tools, I have no interest in "strings" most of
    the time, and would rather not pay anything for them (imagine
    regularly processing a few hundred million filesystem paths), and if
    I *do* care (say for regular-expression based exclusions), then "OK,
    first you have to tell us where the paths came from", i.e. we have
    no way of knowing what the encodings are, other than guessing.

    That said, I'd be more than happy to have *help*, e.g. bytevector
    variants of various srfi-13/srfi-14 functions, and/or (as I think
    suggested elsewhere in the thread) maybe even some hybrid type with
    additional conveniences (if that were to make sense).

    Further, you could imagine having more specific types like the
    "path" type many languages have, depending on what your
    cross-platform goals are, since paths aren't "just bytes"
    everywhere, something which even varies in Linux per-filesystem type
    -- but I didn't consider any of that "in scope" for now.

  - Using Latin-1 is of course, a hack, a pragmatic hack, but a hack,
    (it wasn't even my suggestion, originally).  Choosing that "for now"
    would just be trying to take advantage of the facts that it's likely
    to pass-through without corruption, and still allows easier
    manipulation via the existing string apis for some common, important
    cases, i.e. where you can still get the job done while only
    referring to the ascii bits (split/join on "/", for example), but
    no, it's not ideal.

    It also intends to avoid having to decide, and to do, anything
    further (in the short term) regarding all the existing *many*
    relevant system calls.  You can just call them as-is with a
    temporarily adjusted locale.

  - I have no idea where Guile might eventually end up, but given
    current resources, it seemed likely that what's potentially in scope
    for now is "incremental".

I'll also say that the broader discussion is interesting, and I do like
to better understand how other systems work.

> Le samedi 06 juillet 2024 à 15:32 -0500, Rob Browning a écrit :
>
>> The most direct (and compact, if we do convert to UTF-8) representation
>> would bytevectors, but then you would have a much more limited set of
>> operations available (i.e. strings have all of srfi-13, srfi-14, etc.)
>> unless we expanded them (likely re-using the existing code paths).  Of
>> course you could still convert to Latin-1, perform the operation, and
>> convert back, but that's not ideal.

> Why is that "not ideal"? The (ice-9 iconv) API is convenient, 
> locale-independent
> and thread-safe.

I meant that round-tripping through Latin-1 every time you want to call
say string-split on "/" isn't ideal as compared to a bytevector friendly
splitter.  And if we do switch to UTF-8 internally, it'll also require
copying/converting the bytes since non-ascii bytes become multibyte.

(Given the UTF-8 work, I've also speculated about the fact that we could
 probably re-use many, if not all of the optimized "ascii paths" that
 I've included in the various functions there (srfi-13, srf-14, etc.),
 to implement bytevector friendly variants without much additional
 work.)

Thanks
-- 
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4



reply via email to

[Prev in Thread] Current Thread [Next in Thread]