[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Nano-devel] [patch] properly show invalid byte sequences in UTF-8

From: Benno Schulenberg
Subject: [Nano-devel] [patch] properly show invalid byte sequences in UTF-8
Date: Mon, 13 Apr 2015 21:49:33 +0200

Hi all,

When doing for example:

    echo "0000000: 20c2 bb6f 6f6f 20c2 7878 78" | xxd -r >botched

and then opening the file 'botched' in nano (in a UTF-8 locale),
it will show:

 »ooo »xxx

But the second guillemet isn't really there (if you search for it, the
first one wil be the only occurrence), it is just a ghost.  If you type
other multibyte characters before it, it will change its appearance.
That's because the second (and third and fourth) byte of the preceding
multibyte character will still be present in the corresponding memory
locations of the used variable.  Below patch blots the second byte to
zero, so that only a single and thus invalid byte wil be seen, that will
then get represented as "�", the Unicode replacement character.

So with below patch, the above line will get shown as:

 »ooo �xxx

Index: src/winio.c
--- src/winio.c (revision 5195)
+++ src/winio.c (working copy)
@@ -2043,6 +2043,10 @@
            char *nctrl_buf_mb = charalloc(mb_cur_max());
            int nctrl_buf_mb_len, i;
+           /* Make sure an invalid sequence starter is chopped off
+            * after the first byte. */
+           null_at(&buf_mb, buf_mb_len);
            nctrl_buf_mb = mbrep(buf_mb, nctrl_buf_mb,

-- - Does exactly what it says on the tin

reply via email to

[Prev in Thread] Current Thread [Next in Thread]