[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18520: string ports should not have an encoding

From: David Kastrup
Subject: bug#18520: string ports should not have an encoding
Date: Mon, 22 Sep 2014 15:34:51 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4.50 (gnu/linux)

address@hidden (Ludovic Courtès) writes:

> David Kastrup <address@hidden> skribis:
>> Guile-2.2 does not consult %default-port-encoding but uses UTF-8
>> consistently (I guess, overriding set-port-encoding! will again change
>> that).
>> That still is not satisfactory.  For example, using ftell on the input
>> port will not report the string index of the string connected to the
>> string port but rather a byte index into a UTF-8 encoded version of the
>> string.  This is a number that has nothing to do with the original
>> string and cannot be used for correlating string and port.
> Right.
>> Ports fundamentally deliver characters, and so reading and writing from
>> a string source/sink should not involve _any_ coding system.
>> Files fundamentally deliver bytes, a conversion is required.  The same
>> would be the case when opening a port on a _bytevector_.  Here an
>> encoding would make equally make sense, and ftell/fseek offsets would
>> naturally be in bytes.  But a port on a string delivers and consumes
>> characters.  Any conversion, even a fixed UTF-8 conversion, will destroy
>> the predictable nature of with-output-to-string and
>> with-input-from-string and the respective uses of string ports.
> Guile ports can be mixed textual/binary (unlike R6 ports, which are
> either textual or binary.)  Thus, they fundamentally deliver bytes,
> possibly with a textual conversion.

I think that is a mischaracterization.  GUILE ports at the current point
of time can _only_ be binary, to the degree that strings/texts first
have to be encoded into a binary stream before they can be passed
through a port.  Which is what this issue is about.

> Although the manual isn’t clear about it, ‘ftell’, when available,
> returns a position in bytes.

Which is not helpful if the input does not consist of bytes.

> The situation for string ports here is comparable to that of other
> ports used for textual I/O.

No.  The situation for file ports is that ftell refers to identifiable
and reproducible byte offsets of the input, the input being a file
consisting of bytes and indexed using bytes.

The situation for string ports is that ftell refers to unidentifiable
and incidental byte offsets of a temporary inaccessible ad-hoc encoding
of the input, the input being a string consisting of characters and
indexed using characters.

> Do you have a situation where you were relying on 1.8’s behavior in
> that regard?  Could we see whether this can be solved differently?

I'm currently migrating LilyPond over to GUILE 2.0.  LilyPond has its
own UTF-8 verification, error flagging, processing and indexing.  I have
more than enough crashes and obscure errors to contend with as it
stands, so the first port will use LC_CTYPE=C (LC_CTYPE=ISO-8859-1 does
not work since then GUILE/iconv considers itself entitled to complain
about improper Latin-1) and will keep GUILE 2.0 from thinking about
UTF-8 at all.  Moving string processing to UTF-8 will be a gradual
process, and a separate project involving programmer choices about what
to represent where how: much of LilyPond is written in C++ and so UTF-8
encoded strings (rather than GUILE's strings consisting of either UCS-8
or UCS-32) are ubiquitous, with most of LilyPond's core literals fitting
in the common ASCII subset.

Whenever GUILE chooses to take decisions from the user and programmer,
problems are likely to result, and workarounds will abound.  For
efficiency reasons, it is not realistic to demand that any string data
passed between GUILE and LilyPond will have to be encoded and reencoded
at every call gate: there is a real lot of them.

David Kastrup

reply via email to

[Prev in Thread] Current Thread [Next in Thread]