[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unibyte characters, strings, and buffers

From: Stephen J. Turnbull
Subject: Re: Unibyte characters, strings, and buffers
Date: Sat, 29 Mar 2014 18:23:17 +0900

Eli Zaretskii writes:

 > This thread is about different issues.

*sigh*  No, it's about unibyte being a premature pessimization.

 > >  > Likewise examples from XEmacs, since the differences in this area
 > >  > between Emacs and XEmacs are substantial, and that precludes useful
 > >  > comparison.
 > > 
 > > "It works fine" isn't useful information?
 > No, because it describes a very different implementation.

Not at all.  The implementation of multibyte buffers is very similar.
What's different is that Emacs complifusticates matters by also having
a separate implementation of unibyte buffers, and then basically
making a union out of the two structures called "buffer".  XEmacs
simply implements binary as a particular coding system in and out of
multibyte buffers.

 > Then I guess you will have to suggest how to implement this without
 > unibyte buffers.

No, I don't.  I already told you how to do it: nuke unibyte buffers
and use iso-8859-1-unix as the binary codec.  Then you're done, except
for those applications that actually make the mistake of using unibyte
text explicitly.  If there are cases where unibyte happens implicitly,
and this transformation causes a bug, I think you'll discover unibyte
itself was problematic.

 > >  > In such unibyte buffers, we need a way to represent raw bytes, which
 > >  > are parts of as yet un-decoded byte sequences that represent encoded
 > >  > characters.
 > > 
 > > Again, I disagree.  Unibyte is a design mistake, and unnecessary.
 > Then what do you call a buffer whose "text" is encoded?


 > > XEmacs proves it -- we use (essentially) the same code in many
 > > applications (VM, Gnus for two mbox-using examples) as GNU Emacs does.
 > I asked you not to bring XEmacs into the discussion, because I cannot
 > talk intelligently about its implementation.  If you insist on doing
 > that, this discussion is futile from my POV.

The whole point here is that exactly what the XEmacs implementation is
*irrelevant*.  The point that we implement the same API as GNU Emacs
without unibyte buffers or the annoyances and incoherence that comes
with them.

 > > For heaven's sake, we've had `buffer-as-{multi,uni}-byte defined as
 > > no-ops forever
 > I wasn't talking about those functions.  I was talking about the need
 > to have unibyte buffers and strings.

There is no "need for unibyte."  You're simply afraid to throw it away.

 > How is it different?  What would be the encoding of a buffer that
 > contains raw bytes?

Depends.  If it's uninterpreted bytes, "binary."  If those are
undecodable bytes, they'll be the representation of raw bytes that
occurred in an otherwise sane encoded stream, and the buffer's
encoding will be the nominal encoding of that stream.  If you want to
ensure sanity of output, then you will use an output encoding that
errors on rawbytes, and a program that cleans up those rawbytes in a
way appropriate for the application.  If you expect the next program
in the pipeline to handle them, then you use a variant encoding that
just encodes them back to the original undecodable rawbytes.

 > But that's ridiculous: a raw byte is just a single byte, so
 > string-bytes should return a meaningful value for a string of such
 > bytes.

`string-bytes' should not exist.  As I wrote earlier:

 > > You don't need `string-bytes' unless you've exposed internal
 > > representation to Lisp, then you desperately need it to write correct
 > > code (which some users won't be able to do anyway without help, cf. 
 > > https://groups.google.com/forum/#!topic/comp.emacs/IRKeteTzfbk).  So
 > > *don't expose internal representation* (and the hammer marks on users'
 > > foreheads will disappear in due time, and the headaches even faster!)
 > How else would you know how many bytes will a string take on disk?

How does `string-bytes' help?  You don't know what encoding will be
used to write them, and in general it won't be the same number that
they take up in the string.

If you use iso-8859-1-unix as the coding system, then "bytes on the
wire" == "characters in the string".  No problema, seƱor.

 > >  > So here you have already at least 2 valid reasons
 > > 
 > > No, *you* have them.  XEmacs works perfectly well without them, using
 > > code written for Emacs.
 > XEmacs also works "perfectly well" without bidi and other stuff.  That
 > doesn't help at all in this discussion.

You're right: because XEmacs doesn't handle bidi, it's irrelevant to
this discussion.  Why did *you* bring it up?

What is relevant is how to represent byte streams in Emacs.  The
obvious non-unibyte way is a one-to-one mapping of bytes to Unicode
characters.  It is *extremely* convenient if the first 128 of those
bytes correspond to the ASCII coded character set, because so many
wire protocols use ASCII "words" syntactically.  The other 128 don't
matter much, so why not just use the extremely convenient Latin-1 set
for them?

 > >  > If we want to get rid of unibyte, Someone(TM) should present a
 > >  > complete practical solution to those two problems (and a few
 > >  > others), otherwise, this whole discussion leads nowhere.
 > > 
 > > Complete practical solution: "They are non-problems, forget about
 > > them, and rewrite any code that implies you need to remember them."
 > That a slogan, not a solution.

No, it is a precise high-level design for a solution.  The same design
that XEmacs uses, and which would be quite straightforward for Emacs
to adopt since it already has multibyte buffers of the same power as
XEmacs's, though with (currently) a different internal encoding.

 > > Fortunately for me, I am *intimately* familiar with XEmacs internals,
 > > and therefore RMS won't let me write this code for Emacs. :-)
 > Then perhaps you shouldn't be part of this discussion.

Since I've been invited to leave, I will.  My point is sufficiently
well-made for open minds to deal with the details.  I'll finish this
post on the off chance that somewhere in it will be the key that will
unlock yours.

 > > Which is precisely why we're having this thread.  If there were *no*
 > > Lisp-visibile unibyte buffers or strings, it couldn't possibly matter.
 > And if I had $5M on by bank account, I'd probably be elsewhere
 > enjoying myself.  IOW, how are "if there were no..." arguments useful?

Because they point out that this thread wouldn't have happened with a
different design.  I consider that design better, after experience
with two separate implementations of multibyte only (NEmacs,
XEmacs/MULE), an implementation with strict separation of bytes from
characters (Python 2 with PEP 383), an implementation with strict
separation of bytes from characters and space-efficient character
representation (Python 3 with PEPS 383, 393), and one implementation
with unibyte (Emacs).

The first four work fine dealing with bytes and characters, and there
is no confusion.  Both Pythons can handle undecodable bytes in encoded
streams (ie, roundtrip).  Only GNU Emacs has issues about dealing with
unibyte vs. multibyte.

 > This is not a discussion about whose model is better, Emacs or XEmacs.
 > This is a discussion of whether and how can we remove unibyte buffers,
 > strings, and characters from Emacs.  You must start by understanding
 > how are they used in Emacs 24, and then suggest practical ways to
 > change that.

Well, I would have said "tell me about it", but you've asked me to
leave, so I won't.  I will say nothing you've said so far even hints
at issues with simply removing the whole concept of unibyte.

 > In Emacs, 'insert' does some pretty subtle stuff with unibyte buffers
 > and characters.  If you use it, you get what it does.

And I'm telling you those subtleties are a *problem* that solves
nothing that an Emacs without a unibyte concept can't handle fine.

 > If the buffer is not marked specially, how will I know to avoid
 > [inserting non-Latin-1 characters in a "binary" buffer]?

All experience with XEmacs says *you* (the human programmer) *won't*
have any problem avoiding that.  As a programmer, if you're working
with a binary protocol, you will be using binary buffers and strings,
and byte-sized integers.  If you accidentally mix things up, you'll
quickly get an encoding error on output (since the binary codec can't
output non-Latin-1 Unicode characters.

It's just not a problem in practice, and that's not why unibyte was
introduced in Emacs anyway.  Unibyte was introduced because some folks
thought working with variable-width-encoded buffers was too
inefficient so they wanted access to a flat buffer of bytes.  That's
why buffer-as-{uni,multi}byte type punning was included.

 > > But surely you have a function like `char-int-p'[1] [...]
 > There's char-valid-p, but I don't see how that is relevant to the
 > current discussion.

Only insofar as you thought char-int confusion might be an issue.

 > And I still don't see how this is relevant.  You are describing a
 > marginally valid use case, while I'm talking about use cases we meet
 > every day, and which must be supported, e.g. when some Lisp wants to
 > decode or encode text by hand.

You use `encode-coding-region' and `decode-coding-region', same as you
do now.  Do you seriously think that XEmacs doesn't support those use


reply via email to

[Prev in Thread] Current Thread [Next in Thread]