bug#31679: 26.1; detect-coding-string does not detect UTF-16

bug-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#31679: 26.1; detect-coding-string does not detect UTF-16

From:	Lars Ingebrigtsen
Subject:	bug#31679: 26.1; detect-coding-string does not detect UTF-16
Date:	Thu, 12 Aug 2021 15:51:28 +0200
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)

Eli Zaretskii <eliz@gnu.org> writes:

>> My use-case is that I am trying to paste types other than UTF8_STRING
>> from the X11 clipboard, and have them handled as automatically as
>> possible.  While official clipboard types probably have a documented
>> encoding (and I have code for those), applications like Firefox also put
>> private formats there.  And Firefox seems to like UTF-16, even the
>> text/html format it puts there is UTF-16.
>
> If you have a special application in mind, you could always write some
> simple enough code in Lisp to see if UTF-16 should be tried, then tell
> Emacs to try that explicitly.

I ran into the same issue when dealing with X selections -- but there's
even more peculiarities in that area (some selections add a spurious nul
to the end, and some done), so you have to write a bit of code around
this: `decode-coding-string' in itself can't be expected to deal/guess
all these oddities (as you say).

>> I have tried to debug the C routines that implement this (s.a.), but the
>> code is somewhat hairy.  I guess I'll have another look to see if I can
>> understand it better.
>
> We could add code to detect_coding_system that looks at some short
> enough prefix of the text and sees whether there's a null byte there
> for each non-null byte, and try UTF-16 if so.  Assuming that we want
> to improve the chances of having UTF-16 detected for a small penalty,
> that is.

I do think that, in general, it would be nice if detect_coding_system
did try a bit harder to guess at utf-16.  For instance, if (in the first
X bytes of the string) more than 90% of the byte pairs look like
non-nul/nul pairs, then it's pretty likely to be utf-16.  (And I think
that would be easy enough to implement?)

On the other hand, as you point out, there's a performance penalty that
may not be worth it.

So...  uhm...  does anybody have an opinion here?  Try harder for utf-16
or just leave it as it is?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

[Prev in Thread]

Current Thread

[Next in Thread]

bug#31679: 26.1; detect-coding-string does not detect UTF-16, Lars Ingebrigtsen <=

Prev by Date: bug#32605: [w64] (random) never returns negative
Next by Date: bug#44448: Fwd: bug#44448:
Previous by thread: bug#31948: 26.1; smie-indent-forward-token forward-sexp in strings
Next by thread: bug#31405: 25.3; Python.el doesn't provide infos to info-lookup-symbol
Index(es):
- Date
- Thread