bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer

bug-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer

From:	Eli Zaretskii
Subject:	bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents
Date:	Sat, 05 Oct 2019 21:56:36 +0300

> From: ynyaaa@gmail.com
> Cc: 37580@debbugs.gnu.org
> Date: Sun, 06 Oct 2019 02:18:08 +0900
> 
> Sometimes I find broken utf-8 texts on the Internet.
> Some characters are split into surrogate pairs, and each surrogate
> character is encoded as if it is a normal BMP character.
> 
> utf-8 coding system does not decode such sequences.
> Changing multibyte-ness converts them to surrogate characters.
> And encode-decode process with utf-16be outputs the intended characeters.
> 
> Suppose the character is #x10000,
> the correspoding pair is (#xD800 #xDC00).
> The miss-encoded sequence is:
>   (encode-coding-string "\xD800\xDC00" 'utf-8)
>   => "\355\240\200\355\260\200"
> 
> It is not decoded with utf-8.
>   (decode-coding-string (encode-coding-string "\xD800\xDC00" 'utf-8)
>                         'utf-8)
>   => "\355\240\200\355\260\200"
> 
> Changing multibyte-ness, the sequence is converted into surrogate
> characters.
>   (with-temp-buffer
>     (insert (encode-coding-string "\xD800\xDC00" 'utf-8))
>     (set-buffer-multibyte nil)
>     (set-buffer-multibyte t)
>     (buffer-string))
>   => "\xD800\xDC00"
> 
> The surrogate pair can be converted into the original character.
>   (decode-coding-string (encode-coding-string "\xD800\xDC00" 'utf-16be)
>                         'utf-16be)
>   => "\x10000"

So where's the problem in all this?  AFAIU, you describe a sequence of
actions that successfully recovers text in an obscure situation.

I think the problem is that you enable undo.  So in that case, just
don't do that.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents, ynyaaa, 2019/10/02
- bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents, Eli Zaretskii, 2019/10/02
  - bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents, ynyaaa, 2019/10/05
    - bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents, Eli Zaretskii <=
    - bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents, Stefan Kangas, 2019/10/28

Prev by Date: bug#37633: Column part interpreted wrong in compilation mode
Next by Date: bug#37609: Tool-bar-mode grows the frame's height
Previous by thread: bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents
Next by thread: bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents
Index(es):
- Date
- Thread