[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Emacs rewrite in a maintainable language
From: |
David Kastrup |
Subject: |
Re: Emacs rewrite in a maintainable language |
Date: |
Sun, 18 Oct 2015 18:56:57 +0200 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/25.0.50 (gnu/linux) |
"John Wiegley" <address@hidden> writes:
>>>>>> Eli Zaretskii <address@hidden> writes:
>
>> One of the major lessons Emacs development learned since Emacs 20.1
>> is that raw bytes happen as part of text (a.k.a. "strings"), and
>> therefore there's a need to support a mixture of these two in the
>> same buffer/string. I think that's something Guile should support as
>> well, as that will make it a more powerful and flexible extension
>> language, able to deal with a wider range of real-life situations.
>
> I'd like to second Eli's recommendation. In real life, encoding and
> decoding of bytes to and from characters (codepoints) is never a
> simple problem. We do need good flexibility here.
Personally I have no problem with an implementation insisting on certain
properties for its internal encoding. But that implies that "internal
encoding" and "external UTF-8" may diverge when "external UTF-8" does
not exclusively contain valid UTF-8.
Maintaining that distinction for GUILE should not be hard as currently
its internal encoding is either Latin-1 or UCS-32 so it is not like it
currently _has_ an internal UTF-8 for strings even though it has a
number of functions taking UTF-8 input.
However, if "internal encoding" is not the same as "valid UTF-8"
throughout, it means that code called with it has to be able to deal
with the representations for invalid UTF-8.
Currently Emacs uses code points above the Unicode range for
representing non-Unicode characters from different encodings, and it
uses the 2-byte overlong byte sequences for 0-127 to represent raw bytes
128-255. That's not cast into stone but pretty efficient (I think
Python uses 3-byte surrogate sequences for raw bytes, somewhat worse)
and straightforward as it keeps the basic UTF-8 coding scheme invariants
intact.
Of course, all of this can be done simpler using an UCS-32
representation, but the basic tradeoffs leading to Emacs using a
variable-size multibyte representation are still valid in my opinion.
--
David Kastrup
- Re: Emacs rewrite in a maintainable language, (continued)
- Re: Emacs rewrite in a maintainable language, Daniel Colascione, 2015/10/18
- Re: Emacs rewrite in a maintainable language, David Kastrup, 2015/10/18
- Re: Emacs rewrite in a maintainable language, Paul Eggert, 2015/10/18
- Re: Emacs rewrite in a maintainable language, Taylan Ulrich Bayırlı/Kammer, 2015/10/18
- Re: Emacs rewrite in a maintainable language, Daniel Colascione, 2015/10/18
- Re: Emacs rewrite in a maintainable language, Taylan Ulrich Bayırlı/Kammer, 2015/10/19
- Re: Emacs rewrite in a maintainable language, Richard Stallman, 2015/10/19
- Re: Emacs rewrite in a maintainable language, Taylan Ulrich Bayırlı/Kammer, 2015/10/18
- Re: Emacs rewrite in a maintainable language, Nicolas Petton, 2015/10/18
- Re: Emacs rewrite in a maintainable language, John Wiegley, 2015/10/18
- Re: Emacs rewrite in a maintainable language,
David Kastrup <=
- Re: Emacs rewrite in a maintainable language, Stephen J. Turnbull, 2015/10/18
- Re: Emacs rewrite in a maintainable language, Gian Uberto Lauri, 2015/10/19
- Re: Emacs rewrite in a maintainable language, David Kastrup, 2015/10/17
- Re: Emacs rewrite in a maintainable language, Taylan Ulrich Bayırlı/Kammer, 2015/10/17
- Re: Emacs rewrite in a maintainable language, David Kastrup, 2015/10/17
- Re: Emacs rewrite in a maintainable language, Taylan Ulrich Bayırlı/Kammer, 2015/10/17
- Re: Emacs rewrite in a maintainable language, Eli Zaretskii, 2015/10/17
- Re: Emacs rewrite in a maintainable language, Taylan Ulrich Bayırlı/Kammer, 2015/10/17
- Re: Emacs rewrite in a maintainable language, David Kastrup, 2015/10/17
- Re: Emacs rewrite in a maintainable language, Eli Zaretskii, 2015/10/17