[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18520: string ports should not have an encoding

From: David Kastrup
Subject: bug#18520: string ports should not have an encoding
Date: Tue, 23 Sep 2014 13:54:15 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4.50 (gnu/linux)

address@hidden (Ludovic Courtès) writes:

> David Kastrup <address@hidden> skribis:
>>> Line/column info remains identical regardless of the encoding, so I tend
>>> to think it’s more robust to use that.
>> Column info remains identical regardless of the encoding?  Since when?
> The character on line L and column M is always there, regardless of
> whether the file is encoded in UTF-8, Latin-1, etc.
> Would that work for LilyPond?

Last time I looked, in the following line x was in column 3 in latin-1
encoding and in column 2 in utf-8 encoding:


At any rate, we are missing the point of the issue.  The issue is not
whether a workaround may be designed for every way in which GUILE tries
tripping up its users.  The question is how GUILE may provide the least
amount of surprise to its users without sacrificing functionality.

GUILE's current implementation uses two character set conversions for
string ports.  For input string ports, the first is a batch encoding
when the string port is opened (using %default-port-encoding
resp. "UTF-8" in GUILE-2.0 and GUILE-2.2), this encoding is set as the
port's encoding (I hope) and then, unless changed, every read operation
employs the encoding that is, at any given time, current.

Accompanying the opening of a string with an encoding operation (whether
using a forced encoding or %default-port-encoding) is expensive (not
least of all because everything needs to be decoded again), leads to
arbitrary semantics for port positioning, and is asymmetric since the
port encoding is only used for reading on an input string and for
writing on an output string.

Oh, and for writing on an input string using unread-string, of course.
No kidding.  There is also a conversion in there.

Would it be worth ditching the sort of unnecessary conversion?  Well,
just look at:

    commit be7ecef05c1eea66f30360f658c610710c5cb22e
    Author: Andy Wingo <address@hidden>
    Date:   Sat Aug 31 10:44:07 2013 +0200

        unread-char: inline conversion from codepoint to bytes

        * libguile/ports.c (scm_ungetc_unlocked): Inline the conversion from
          codepoint to bytes for UTF-8 and latin-1 ports.  Speeds up a
          numbers-reading test case by 100% (!).

That sounds like quite some gain just for _simplifying_ the
back-and-forth conversion, and we could be just foregoing it instead
(yes, peek-char as getc+ungetc presents a challenge in connection with
encoding switches: I think that declaring the first impression of
peek-char as sticky would be reasonable).

At any rate, the above commit looks like it would make a hash out of

(with-input-from-string "Huh\""
  (lambda ()
    (unread-string "\"ä" (current-input-port))

because of a broken character range check (I cannot currently check with
a compilation of master since that takes about a day on my computer, but
I would be surprised if the above worked fine).  So yes, the required
complexity to deal with GUILE's current behavior can introduce problems.

David Kastrup

reply via email to

[Prev in Thread] Current Thread [Next in Thread]