help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: `write-region' writes different bytes than passed to it?


From: Philipp Stephani
Subject: Re: `write-region' writes different bytes than passed to it?
Date: Sun, 10 Feb 2019 20:06:57 +0100

Am So., 23. Dez. 2018 um 16:21 Uhr schrieb Eli Zaretskii <eliz@gnu.org>:
>
> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Sat, 22 Dec 2018 23:58:07 +0100
> > Cc: help-gnu-emacs <help-gnu-emacs@gnu.org>
> >
> > > Yes, because "\xC1\xB2" just happens to be the internal multibyte
> > > representation of a raw-byte F2.  Raw bytes are always converted to
> > > their single-byte values on output, regardless of the encoding you
> > > request.
> > >
> >
> > Is that documented somewhere?
>
> Which part(s)?

All of it? ;)
Basically, "what is the behavior of write-region".

>
> > Or, in other words, what are the semantics of
> >
> > (let ((coding-system-for-write 'utf-8-unix)) (write-region STRING ...))
> >
> > ?
> >
> > There are two easy cases:
> > 1. STRING is a unibyte string containing only bytes within the ASCII range
> > 2. STRING is a multibyte string containing only Unicode scalar values
> > In those cases the answer is simple: The form writes the UTF-8
> > representation of STRING.
> > However, the interesting cases are as follows:
> > 3. STRING is a unibyte string with at least one byte outside the ASCII range
> > 4. STRING is a multibyte string with at least one elements that is not
> > a Unicode scalar value
>
> You are actually asking what code conversion does in these cases, so
> let's limit the discussion to that part.  write-region is not really
> relevant here.
>
> One technicality before I answer the question: there are no "Unicode
> scalar values" in Emacs strings and buffers.  The internal
> representation is a multibyte one, so any non-ASCII character, be it a
> valid Unicode character or a raw byte, is always stored as a multibyte
> sequence.  So let's please use a less confusing wording, like
> "strictly valid UTF-8 sequence" or something to that effect.

I don't think we should change the terminology. Emacs multibyte
strings are sequences of integers (in most cases, scalar values), not
UTF-8 strings. They are internally represented as byte arrays, but
that's a different story.

>
> > My example is an instance of (3). I admit I haven't read the entire
> > Emacs Lisp reference manual, but quite some parts of it, and I
> > couldn't find a description of the cases (3) and (4). Naively there
> > are a couple options:
> > - Signal an error. That would seem appropriate as such strings can't
> > be encoded as UTF-8. However, evidently Emacs doesn't do this.
> > - For case 3, write the bytes in STRING, ignoring the coding system. I
> > had expected this to happen, but apparently it isn't the case either.
>
> IMO, doing encoding on unibyte strings invokes undefined behavior,
> since encoding is only defined for multibyte strings.

That is very unfortunate. Is there any hope we can get out of that situation?

> Admittedly, we
> don't say that explicitly (we could if that's deemed important), but
> the entire description in "Coding System Basics" makes no sense
> without this assumption, and even hints on that indirectly:
>
>      The coding system ‘raw-text’ is special in that it prevents character
>   code conversion, and causes the buffer visited with this coding system
>   to be a unibyte buffer.  For historical reasons, you can save both
>   unibyte and multibyte text with this coding system.
>
> The last sentence implicitly tells you that coding systems other than
> raw-text (with the exception of no-conversion, described in the very
> next paragraph) can only be meaningfully used when writing multibyte
> text.

That's true, but very subtle. You first have to read the description
of a certain encoding to figure out how other encodings behave.

>
> Since this is undefined behavior, Emacs can do anything that best
> suits the relevant use cases.  What it actually does is convert raw
> bytes from their internal two-byte representation to a single byte.
> Emacs jumps through many hoops to avoid exposing the internal
> multibyte representation of raw bytes outside of buffers and strings,
> and this is one of those hoops.  That's because exposing that internal
> representation is considered to be corruption of the original byte
> stream, and is not generally useful.

But in this question there is never any internal representation, just
a byte array that happens to match the internal representation of
something else.

>
> Signaling an error in this situation is also not useful, because it
> turns out many Lisp programs did this kind of thing in the past (Gnus
> is a notable example), and undoubtedly quite a few still do.

Well, if the behavior is unspecified, then signaling an error would
absolutely be a legal (and even expected) behavior.

>
> Emacs handles this case like it does because many years of bitter
> experience have taught us that this suits best the use cases we want
> to support.  In particular, signaling errors when encountering invalid
> UTF-8 sequences is a bad idea in a text-editing application, where
> users expect an arbitrary byte stream to pass unscathed from input to
> output.  This is why Emacs is decades ahead of other similar systems,
> such as Guile, which still throw exceptions in such cases (and claim
> that they are "correct").

I'm not saying that Emacs should necessary start signaling errors when
visiting files with invalid UTF-8 sequences. That it degrades
gracefully in this case is very welcome and user-friendly.
But visiting a file can't result in a call to write-region with a
unibyte string, right?

>
> > > I'm not sure that single use case is important enough to change
> > > something that was working like that since Emacs 23.  Who knows how
> > > many more important use cases this will break?
> >
> > It's important for correctness and for actually describing what "encoding" 
> > does.
>
> So does labeling this as undefined behavior, which is what it is.  We
> don't really need to describe undefined behavior in detail, because
> Lisp programs shouldn't do that.

Rather than describing it in detail, it should be removed. Unspecified
behavior makes a programming system hard to use and reason about.

>
> > Do we expect users to explicitly put the byte sequences for the
> > (undocumented) internal encoding into unibyte strings? Shouldn't we
> > rather expect that users want to write unibyte strings as is, in all
> > cases?
>
> To avoid the undefined behavior, a Lisp program should never try to
> encode a unibyte string with anything other than no-conversion or
> raw-text (the latter also allows the application to convert EOL
> format, if that is desired).  IOW, you should have used either
> raw-text-unix or no-conversion in your example, not utf-8.

If Lisp code shouldn't try that, then the encoding functions should
signal an error on such cases.

>
> > > Oh, indeed, especially since it sounds to me like the problem is in the
> > > original code (if you don't want to change bytes, the use a `binary`
> > > encoding rather than utf-8).
> >
> > That wouldn't work with multibyte strings, right? Because they need to
> > be encoded.
>
> You can detect when a string is a unibyte string with
> multibyte-string-p, if your application needs to handle both unibyte
> and multibyte strings.  For unibyte strings, use only raw-text or
> no-conversion.

I get that, but this is too subtle and nontrivial.

>
> > > Exactly: I think we should try and get rid of those heuristics
> > > (progressively).  Actually, it's already what we've been doing since
> > > Emacs-20, tho "lately" the progression in this respect has slowed
> > > down I think.
> >
> > I'd definitely welcome any simplification in this area. There seems to
> > be a lot of incidental complexity and undocumented corner cases here.
>
> AFAIK, all of that heuristics are in the undefined behavior
> department.  Lisp programs are well advised to stay away from that.
> If Lisp programs do stay away, they will never need to deal with the
> complexity and the undocumented corner cases.

You can't tell programmers to stay away from something. Either it
should work as documented or signal an error. Silently doing the wrong
thing is the worst choice.

>
> We keep the current behavior for backward compatibility, and for this
> reason I would suggest to avoid changes in this area unless we have a
> very good reason for a change.  What was the reason you needed to
> write something like the original snippet?
>

I'm writing a function to write an arbitrary string to a file. This
should be trivial, but as you can see, it isn't.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]