guile-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: guile can't find a chinese named file


From: David Kastrup
Subject: Re: guile can't find a chinese named file
Date: Mon, 27 Feb 2017 10:10:55 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/26.0.50 (gnu/linux)

Andy Wingo <address@hidden> writes:

> Hello,
>
> I feel the need to correct points in this mail for the benefit of
> guile-user.  No reply is needed.
>
> On Wed 15 Feb 2017 00:58, David Kastrup <address@hidden> writes:
>
>> Mike Gran <address@hidden> writes:
>>
>>> But, for what it is worth, the Latin-1/UCS-32 design decision came
>>> from a couple of conflicting requirements.  The switch happened in the
>>> 1.9.x series.
>>>
>>> There was several examples of legacy C code using Guile for an
>>> extension language that accessed the bytes of a string directly, using
>>>
>>> SCM_STRING_CHARS or scm_i_string_chars.  To keep from breaking legacy
>>> code, we needed to retain the capability to use this (then already
>>> deprecated) capability to have C programs access 8-bit-locale string
>>> internals directly.
>>
>> But if you don't know whether the strings are Latin-1 or UCS-32, that's
>> sort of academical.
>
> Not at all.  Legacy programs don't use codepoints >255.

Sort of a moot point when Guile makes the decision to interpret external
files with codepoints >255.  Not every data processed by a "legacy
program" originates from inside the program.

>> The problem is that Guile is _constantly_ required to recode strings
>> it is processing.  And to add insult to injury, it cannot do this
>> without data loss when its string encoding assumptions are wrong.
>
> In Scheme, strings are sequences of characters.  Encoding and decoding
> is only needed when going to and from bytes.

A string port is strictly passing characters to characters completely
inside of Guile and its data structures and yet it needs to encode and
decode from Latin-1/UCS-32 to UTF-8.  A string port is _explicitly_ not
a binary stream (there are special binary ports for that) but a
character sequence and yet Guile is encoding and decoding for working
with its own internal data.

And the string API contains only scm_from_utf8_string (which always
requires reencoding) for accessing the whole character set.  It isn't
named scm_decode_utf8_bytestream: its target conceptually is a _string_,
yet it is expensive to pass into Guile and back out and there is no
cheaper or more transparent mechanism available.

>> PostScript files are usually encoded in Latin-1 with occasional UCS-16
>> passages.  Reading and writing and copying such files byte-correctly
>> while trying to actually parse their contents is not feasible with
>> Guile.
>
> Works perfectly well.  The web server for example reads the request as
> Latin-1 and the body as something else.  Just re-set the port encoding
> and there you go.

Reading and writing and copying cannot always afford to _parse_ and
switch encodings based on the content.  It needs to work even when you
don't do that.

>> As I said: the problem is not the chosen internal representation.
>> The problem is that there is no API to access it, and it does not
>> even map to string ports.
>
> String ports have nothing to do with the discussion AFAIU.  (Ports in
> Guile are sequences of bytes also.

Which is exactly the problem.

> They may be accessed using textual interfaces as well.

They can _only_ be accessed using textual interfaces.  They are
character-in/character-out.

> Therefore a string port must have an associated encoding, to
> read/write the bytes.

Why does a pure character-in/character-out structure need an associated
encoding?  The semi-equivalent in Emacs are buffers (which have a
manipulation point where you can write/read but are also random-access,
so it's sort of a superset).  Buffers have an _internal_ encoding but it
isn't exposed and it is identical to strings' internal encodings.

In contrast, the internal encoding of Guile string ports _is_ exposed
since its positioning uses byte offsets rather than character offsets
and thus is not compatible with string addressing.

Emacs got rid of this catastrophic user interface mistake (responsible
for the last major wave of migration to its competitor XEmacs) in Emacs
20.3 or 20.4.  Buffers are only ever addressed using character positions
from Emacs Lisp.

It's just painful to see Guile go through all of the expensive mistakes
Emacs made 15 or 20 years ago, just at a tenth of the speed since
getting encodings wrong was seen as more of a deal-breaker with Emacs.

> But no error is possible for textual I/O with the default UTF-8
> encoding as all characters are representable.

But all bytes aren't.

> Encoding to UTF-8 is fast and space-efficient.)

There is a reason that LilyPond on Guile-2.0 runs slower by a factor
of 5 than on Guile-1.8, and the large costs associated with constant
string reencoding are definitely contributing.

-- 
David Kastrup




reply via email to

[Prev in Thread] Current Thread [Next in Thread]