guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Improving the handling of system data (env, users, paths, ...)


From: Eli Zaretskii
Subject: Re: Improving the handling of system data (env, users, paths, ...)
Date: Sun, 07 Jul 2024 14:04:37 +0300

> From: Jean Abou Samra <jean@abou-samra.fr>
> Cc: guile-devel@gnu.org
> Date: Sun, 07 Jul 2024 12:03:06 +0200
> 
> Le dimanche 07 juillet 2024 à 08:33 +0300, Eli Zaretskii a écrit :
> > 
> >     - The internal representation is a superset of UTF-8, in that it
> >       is capable of representing characters for which there are no
> >       Unicode codepoints (such as GB 18030, some of whose characters
> >       don't have Unicode counterparts; and raw bytes, used to
> >       represent byte sequences that cannot be decoded).  It uses
> >       5-byte UTF-8-like sequences for these extensions.
> 
> 
> Guile is a Scheme implementation, bound by Scheme standards and compatibility
> with other Scheme implementations (and backwards compatibility too).

Yes, I understand that.

> I just tried (aref (cadr command-line-args) 0) in a lisp-interaction-mode
> Emacs buffer after launching "emacs $'\xb5'". It gave 4194229 = 0x3fffb5,
> which quite logically is outside the Unicode code point range 0 - 0x110000.

That's not how you get a raw byte from a multibyte string in Emacs.
IOW, you code is wrong, if what you wanted was to get the 0xb5 byte.
I guess you assumed something about 'aref' in Emacs that is not true
with multibyte strings that include raw bytes.  So what you got
instead is the internal Emacs "codepoint" for raw bytes, which are in
the 0x3fff00..0x3fffff range.

Note that (cadr command-line-args), for example, yields "\265", as
expected.  That is, in situation where the caller's intent is clear,
Emacs converts back to a single byte automatically.  That's part of
heuristics that took us some releases to get right.

> This doesn't work for Guile, since a character is a Unicode code point
> in the Scheme semantics.

See above: the problem doesn't exist if one uses the correct APIs.

> >     - Emacs has its own code for code-conversion, for moving by
> >       characters through multibyte sequences, for producing a Unicode
> >       codepoint from a byte sequence in the super-UTF-8 representation
> >       and back, etc., so it doesn't use libc routines for that, and
> >       thus doesn't depend on the current locale for these operations.
> 
> Guile's encoding conversions don't rely on the libc locale. They use
> GNU libiconv.

That's okay, but what about other APIs, like conversion between
characters and their multibyte representations, returning the length
of a string in characters, etc.?  AFAIK, libiconv doesn't provide
these facilities.

> >     - Emacs also has tables of Unicode attributes of characters
> >       (produced by parsing the relevant Unicode data files at build
> >       time), so it can up/down-case characters, determine their
> >       category (letters, digits, punctuation, etc.) and script to
> >       which they belong, etc. -- all with its own code, independent of
> >       the underlying libc.
> 
> Also exists, and AFAICT uses GNU libunistring. See string-upcase,
> char-general-category, etc.

Fine, then it should be easier for Guile than I maybe thought to adopt
the same scheme.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]