help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: `write-region' writes different bytes than passed to it?


From: Philipp Stephani
Subject: Re: `write-region' writes different bytes than passed to it?
Date: Sun, 10 Feb 2019 20:15:57 +0100

Am Mo., 24. Dez. 2018 um 05:28 Uhr schrieb Stefan Monnier
<monnier@iro.umontreal.ca>:
>
> > There are two easy cases:
> > 1. STRING is a unibyte string containing only bytes within the ASCII range
> > 2. STRING is a multibyte string containing only Unicode scalar values
> > In those cases the answer is simple: The form writes the UTF-8
> > representation of STRING.
>
> Not sure what you mean by "unicode scalar values"

What the Unicode standard says :)

> but a multibyte
> string is a sequence of chars, i.e. a sequence of char codes (integers)
> And utf-8 is a way to encode a sequence of integer char codes into
> a sequence of bytes.

"Character" is an underspecified term, therefore I generally try to avoid it.
To recap: An Emacs Lisp multibyte string is a sequence of integers of
a certain range. The range is a superset of the set of Unicode scalar
values.

>
> So your sample code will pretty much always write the utf-8
> representation of the multibyte string.
>
> [ The only exception is when the multibyte string contains chars in the
>   eight-bit charset, because those are supposed to stand for raw bytes.
>   This is exception is used to make sure that if you read a file using
>   the utf-8 coding-system and the file's content is not valid utf-8,
>   writing the buffer will still generate the exact same byte sequence.  ]
>
> > However, the interesting cases are as follows:
> > 3. STRING is a unibyte string with at least one byte outside the ASCII range
>
> I don't think this case is clearly documented, indeed.
>
> I believe what happens currently is that Emacs looks at the byte
> sequence in the unibyte string as if it was the internal representation
> of a multibyte string.  Changing behavior (e.g. by simply outputting the
> bytes unchanged like I suggested) will likely affect some code out there
> somewhere.  I think it'd be a good change, tho, because I think that any
> code thus affected is likely buggy and needs to be fixed anyway (and
> actually that change might be the fix the code needs).
>
> What makes this question a bit more tricky is that when a string is all
> ASCII, Emacs tends to choose rather arbitrarily between unibyte
> and multibyte.  But if we decide that coding-system doesn't affect
> unibyte strings, then we get into trouble with
>
>     (let ((coding-system-for-write 'ebcdic-int)) (write-region STRING ...))
>
> since for a purely ASCII string, we still need to do a conversion,
> so we'd need to be more careful about the distinction between unibyte and
> multibyte ASCII strings.
>
> Maybe we should just drop support for coding systems that aren't
> supersets of ASCII and be done with it, but I'm not sure we're ready to
> do that.
>


That might be one option. Others might be:
1. Signal an error whenever Emacs attempts to encode a unibyte string
and the encoding isn't "raw-text" or "no-conversion"
2. Like (1), but only signal an error if the encoding isn't ASCII-compatible



reply via email to

[Prev in Thread] Current Thread [Next in Thread]