[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
"builtin printf '\uFF8E'" generates broken surrogate pairs in Cygwin
From: |
Koichi MURASE |
Subject: |
"builtin printf '\uFF8E'" generates broken surrogate pairs in Cygwin |
Date: |
Sun, 6 Nov 2016 13:38:09 +0900 |
Hello, let me send a bashbug report as follows.
Configuration Information [Automatically generated, do not change]:
Machine: i686
OS: cygwin
Compiler: gcc
Compilation CFLAGS: -DPROGRAM='bash.exe' -DCONF_HOSTTYPE='i686'
-DCONF_OSTYPE='cygwin' -DCONF_MACHTYPE='i686-pc-cygwin'
-DCONF_VENDOR='pc' -DLOCALEDIR='/usr/share/locale' -DPACKAGE='bash'
-DSHELL -DHAVE_CONFIG_H -DRECYCLES_PIDS -I.
-I/usr/src/bash-4.3.46-7.i686/src/bash-4.3
-I/usr/src/bash-4.3.46-7.i686/src/bash-4.3/include
-I/usr/src/bash-4.3.46-7.i686/src/bash-4.3/lib -DWORDEXP_OPTION -ggdb
-O2 -pipe -Wimplicit-function-declaration
-fdebug-prefix-map=/usr/src/bash-4.3.46-7.i686/build=/usr/src/debug/bash-4.3.46-7
-fdebug-prefix-map=/usr/src/bash-4.3.46-7.i686/src/bash-4.3=/usr/src/debug/bash-4.3.46-7
uname output: CYGWIN_NT-10.0-WOW magnate2016 2.6.0(0.304/5/3)
2016-08-31 14:27 i686 Cygwin
Machine Type: i686-pc-cygwin
Bash Version: 4.3
Patch Level: 46
Release Status: release
Description:
I noticed that built-in commands "printf '\uFF8E'", etc. generate
broken surrogate pairs in Cygwin.
Repeat-By:
$ echo $MACHTYPE
i686-pc-cygwin
$ echo $LANG
ja_JP.UTF-8
$ printf '\uFF8E\n' # <-- U+FF8E is "halfwidth kana Ho", one of
Japanese characters.
?? # <-- Some unknown characters are output.
$ /bin/printf '\uFF8E\n'
ホ # <-- OK with /bin/printf
$ printf '\uFF8E' | od -t x1 -A n
ed 9f bf ed be 8e # <-- This is utf-8 representation of <U+D7FF U+DF8E>.
Here one notices that <U+D7FF U+DF8E> is a broken surrogate pair.
The first element of surrogate pairs should be in the range from
U+D800 to U+DBFF, and the second should be in the range from U+DC00 to
U+DFFF. Anyway, the character U+FF8E cannot be represented by a
surrogate pair.
Fix:
I think the function "u32toutf16 (c, s)" in lib/sh/unicode.c is
broken. Note that this function is only used in systems where "sizeof
(wchar_t) == 2". Cygwin is one of them. Also, I checked the latest
version of bash-4.4 (patch level 0) source codes, and the function is
not yet fixed there: The characters in the range from U+E000 to U+FFFF
should not be encoded in surrogate pairs; they don't have
surrogate-pair representations.
diff --git a/lib/sh/unicode.c b/lib/sh/unicode.c
index b58eaef..29acac6 100644
--- a/lib/sh/unicode.c
+++ b/lib/sh/unicode.c
@@ -219,12 +219,12 @@ u32toutf16 (c, s)
int l;
l = 0;
- if (c < 0x0d800)
+ if (c < 0x0d800 || (c >= 0x0e000 && c <= 0x0ffff))
{
s[0] = (unsigned short) (c & 0xFFFF);
l = 1;
}
- else if (c >= 0x0e000 && c <= 0x010ffff)
+ else if (c >= 0x10000 && c <= 0x010ffff)
{
c -= 0x010000;
s[0] = (unsigned short)((c >> 10) + 0xd800);
Regards,
Koichi
- "builtin printf '\uFF8E'" generates broken surrogate pairs in Cygwin,
Koichi MURASE <=