[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: creating unibyte strings

From: Stefan Monnier
Subject: Re: creating unibyte strings
Date: Fri, 22 Mar 2019 10:23:20 -0400
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (gnu/linux)

>> >> Which reminds me: could someone add to the module API a primitive to
>> >> build a *unibyte* string?
>> > I don't like adding such a primitive.  We don't want to proliferate
>> > unibyte strings in Emacs through that back door, because manipulating
>> > unibyte strings involves subtle issues many Lisp programmers are not
>> > aware of.
>> I don't see what's subtle about "unibyte" strings, as long as you
>> understand that these are strings of *bytes* instead of strings
>> of *characters* (i.e. they're `int8[]` rather than `w_char_t[]`).
> That's the subtlety, right there.  Handling such "strings" in Emacs
> Lisp can produce strange and unexpected results for someone who is not
> aware of the difference and its implications.

But this has nothing to do with the modules API: it's not more tricky
then when doing it purely in Elisp.  Are you seriously suggesting we
deprecate unibyte strings altogether?

>> "Multibyte" strings are just as subtle (maybe more so even), yet we
>> rightly don't hesitate to offer a primitive way to construct them.
> Because we succeed to hide the subtleties in that case,
> so the multibyte nature is not really visible on the Lisp level,
> unless you try very hard to make it so.

Then I don't know what subtleties you're talking about.
Can you give some examples of the kinds of things you're thinking of?

>> > Instead, how about doing that via vectors of byte values?
>> What's the advantage?  That seems even more convoluted: create a Lisp
>> vector of the right size (i.e. 8x the size of your string on a 64bit
>> system), loop over your string turning each byte into a Lisp integer
>> (with the reverted API, this involves allocation of an `emacs_value`
>> box), then pass that to `concat`?
> That's one way, but I'm sure I can come up with a simpler one. ;-)

I'm all ears.

>> It's probably going to be even less efficient than going through utf-8
>> and back.
> I doubt that.  It's just an assignment.  And it's a rare situation
> anyway.

Why do you think it's rare?
It's pretty common to receive non-utf-8 byte streams from the external world.
And when you do receive them, it can come at a very fast pace and become
temporarily anything but rare.

>> Think about cases where the module receives byte strings from the disk
>> or the network and need to pass that to `decode-coding-string`.
>> And consider that we might be talking about megabytes of strings.
> They don't need to decode, they just need to arrange for it to be
> UTF-8.

Three possibilities:
1- the C side string contains utf-8 text.
   The module API provides just the right operation, we're good to go.
2- the C side string contains text in latin-1, big5, younameit.
   The module API provides nothing convenient.  Should we force our
   module to link to C-side coding-system libraries to convert to utf-8
   before passing it on to the Elisp, even though Emacs already has all
   the needed facilities?  Really?
3- The C side string contains binary data (say PNG images).
   What does "arrange for it to be UTF-8" even mean?

-- Stefan

PS: The PNG case is not hypothetical at all, it's what prompted my request.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]