bug with case conversion of UTF-8 characters

From: Stephane Chazelas
Subject: bug with case conversion of UTF-8 characters
Date: Thu, 22 Jan 2015 14:43:00 +0000
User-agent: Mutt/1.5.21 (2010-09-15)

Configuration Information [Automatically generated, do not change]:
Machine: x86_64
OS: linux-gnu
Compiler: gcc
Compilation CFLAGS:  -DPROGRAM='bash' -DCONF_HOSTTYPE='x86_64' 
-DCONF_OSTYPE='linux-gnu' -DCONF_MACHTYPE='x86_64-pc-linux-gnu' 
-DCONF_VENDOR='pc' -DLOCALEDIR='/usr/share/locale' -DPACKAGE='bash' -DSHELL 
-DHAVE_CONFIG_H   -I.  -I../. -I.././include -I.././lib  -D_FORTIFY_SOURCE=2 -g 
-O2 -fstack-protector-strong -Wformat -Werror=format-security -Wall
uname output: Linux host 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt2-1 
(2014-12-08) x86_64 GNU/Linux
Machine Type: x86_64-pc-linux-gnu

Bash Version: 4.3
Patch Level: 30
Release Status: release

(Debian unstable amd64)

$ LC_ALL=tr_TR.UTF-8 bash -c 'typeset -l a; a=İ; echo $a' | hd
00000000  69 b0 0a                                          |i..|
$ a=İ LC_ALL=tr_TR.UTF-8 bash -c 'echo ${a,,}' | hd
00000000  69 b0 0a                                          |i..|

In Turkish locales on a GNU system at least, uppercase i is İ,
not I. And lowercase I is ı, not i.

İ was properly translated to i, but there's a spurious 0xb0
which probably comes from the original İ

$ echo İ | hd
00000000  c4 b0 0a                                          |...|

The reverse problem:

$ a=i LC_ALL=tr_TR.UTF-8 bash -c 'echo ${a^^}'
$ a=I LC_ALL=tr_TR.UTF-8 bash -c 'echo ${a,,}'
$ LC_ALL=tr_TR.UTF-8 bash -c 'typeset -u a; a=ia;echo $a' | hd
00000000  69 41 0a                                          |iA.|

That affects other characters where the lower/upper
case counterpart don't have the same number of bytes in their
UTF-8 encoding. Here, in a en_US.UTF-8:

$ a=$'\u027D' bash -c 'echo $a ${a^^}' | hd
00000000  c9 bd 20 e2 bd a4 03 0a                           |.. .....|
$ a=$'\u027D' zsh -c 'echo $a ${(U)a}' | hd
00000000  c9 bd 20 e2 b1 a4 0a                              |.. ....|

(this time, the translated character is *larger*, still there's
a spurious 0x03 byte, which this time is not coming from the
original character, possibly from the stack).


