[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Emacs Lisp's future

From: David Kastrup
Subject: Re: Emacs Lisp's future
Date: Mon, 13 Oct 2014 09:41:55 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4.50 (gnu/linux)

Mark H Weaver <address@hidden> writes:

> David Kastrup <address@hidden> writes:
>> The conceptual lack of separation between internal and external utf-8
>> encoding leads to strangenesses like
>> scheme@(guile-user)> (with-input-from-string "\ufeff!" read-char)
>> $8 = #\!
>> Yes, this is a string->string operation losing a byte order mark in
>> spite of no indication that I would like to get encodings involved in
>> any manner.
> Byte Order Marks are an ugly corner of Unicode, and I spent a lot of
> effort to try to do the right thing here.  What we do in Guile is
> described here:
>   https://www.gnu.org/software/guile/manual/html_node/BOM-Handling.html
> I agree that we should inhibit BOM handling for string ports.
>> And when I can say "let's see where this kind of thinking will lead" and
>> find a hole to poke within a minute,
> BTW, your claim that you found this hole "within a minute" is a
> bald-faced lie and you know it.

> In <http://bugs.gnu.org/18520>, I stated my belief that our internal
> use of UTF-8 in string ports was not visible to the application as
> long as you didn't manually change the encoding for the string port or
> use seek/ftell.  That was on Sept 24th.

Uh, my claim was not that I found this problem a minute after first
thinking about GUILE's string handling.  It was more about how long it
took me after deciding to look for an example for _this_ discussion.
Now my above description may not be accurate since "let's see where this
kind of thinking will lead" is obviously not something that occured to
me just these days, or even these years.  So it applied to the more
concrete case of reading in the GUILE manual about its BOM handling,
making the connection to string ports, thinking "now that's likely to be
another half-baked bean", and finding that issue by experiment.

To the best of my memory, this _was_ the first time I read about BOM
handling in GUILE.  That does not mean that I can vouch for this page
never having been on-screen before, or even me having skimmed through
it.  But it definitely is the first time I remember having read it now.

> You spent a *lot* of time arguing with us in that bug report, and this
> is exactly the observation you could have used to bolster your
> argument, but you never found it until now.

Because I did not look for it before.  At any rate, in relation to that
bug report I had a different actual example exposed in
<URL:http://debbugs.gnu.org/cgi/bugreport.cgi?bug=18520#41> (for which I
provided a patch in
<URL:http://debbugs.gnu.org/cgi/bugreport.cgi?bug=18536>). Here the
attempt to create an open-coded fast path to speed up a few gratuitous
conversions when reading numbers from a string port (encode to UTF-8
because string ports are implemented as byte streams, decode when
reading, reencode when ungetting the non-digit read after the last
digit, redecode when reading it again...).  I think it was more or less
sorted into the "one bug does not demonstrate a problem" category.

That bug jumped out at me not when I was searching for a redecoding
problem but rather when I looked at the code in ports.c (which that
issue was about) after musing "how are they going to unread in a string
port?".  And the open-coded conversion was there to to avoid calling the
apparently slow libunistring (yes, libunistring) function

Bugs happen.  But code that is not called in the first place can cause
no bug.

At any rate, when looking for a snappy "this might not work well with
reencoding example" on the Emacs Lisp, I first looked at surrogate

Well, (integer->char #xd800) throws an out-of-range error.  So one is
not even allowed to talk about surrogate words at the character/word
level, look for them with regular expressions and so on.

I have some choice words for that as well, but it's not a bug.  It's
pretty much a necessary consequence of the design that does not give
representation to input outside of the proper UTF-8 range.  Since "not
practical" was already cried down as a consideration in this discussion,
I wanted an actual bug rather than just a refusal to work with things
defined as invalid.

So I looked in the GUILE manual to see whether I could find something
about surrogate words and instead chanced upon "BOM" which apparently
_was_ allowed into strings, so I just thought "oh, that could be an
equally bad can of worms".  And admittedly, my first try was using the
string port in the other direction, namely with-output-to-string.  From
the description I'd have expected _that_ to blow up rather than the
other way round.

And the time from "oh, this one could be bad as well" to finding the
problem (I am not even sure it is a bug rather than a particularly
jarring but logical consequence of the way string ports are defined in
GUILE as a byte stream with encoding) was not more than a few minutes at
best.  A fix will likely be equally fast to do, and there is a school
that every sufficiently patched-up software is indistinguishable from

So that's the history of this bald-faced lie of mine.  I am sure that
I offer better opportunities for ad hominem attacks than that.

David Kastrup

reply via email to

[Prev in Thread] Current Thread [Next in Thread]