[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Locale not Obeyed by Parameter Expansion with Pattern Substitution
From: |
Stephane Chazelas |
Subject: |
Re: Locale not Obeyed by Parameter Expansion with Pattern Substitution |
Date: |
Tue, 19 Nov 2019 07:56:57 +0000 |
User-agent: |
NeoMutt/20171215 |
2019-11-18 20:46:26 +0000, Stephane Chazelas:
[...]
> > printf -v B '\u204B'
> > set -- ${B//?()/ }
> > echo "${@@Q}" #-> $'\342' $'\201' $'\213'
[...]
> It seems to me that zsh's approach is best:
>
> $ A=$'\u2048\201\u2048' zsh -c "printf '%q\n' \"\${A//$'\201'/:}\""
> ⁈:⁈
>
> That is replace that \201 byte, except when it's part of a
> properly encoded character.
[...]
Actually, zsh would also break a character if the byte to be
replaced is the first of the character:
$ A=$'\u2048\342\u2048' zsh -c "printf '%q\n' \"\${A//$'\342'/:}\""
:$'\201'$'\210'::$'\201'$'\210'
Note that in charsets like BIG5/GB18030... which have characters
whose encoding contains the encoding of other characters, bash
seems to behave better than in UTF-8.
For instance the encoding of é in BIG5-HKSCS is 0x88 0x6d where
0x6d is also the encoding of "m" like in ASCII.
$ printf é | iconv -t big5-hkscs | od -tc -tx1
0000000 210 m
88 6d
0000002
$ LC_ALL=zh_HK.big5hkscs luit
$ U=Stéphane bash -c 'printf "%s\n" "${U//m}"'
Stéphane
$ U=Stéphane ksh93 -c 'printf "%s\n" "${U//m}"'
Stéphane
$ U=Stéphane zsh -c 'printf "%s\n" "${U//m}"'
Stéphane
All 3 shells OK, but:
$ U=Stéphane bash -c 'printf "%s\n" "${U//$'\''\210'\''}"'
Stmphane
$ U=Stéphane ksh -c 'printf "%s\n" "${U//$'\''\210'\''}"'
Stmphane
$ U=Stéphane zsh -c 'printf "%s\n" "${U//$'\''\210'\''}"'
Stmphane
All 3 shells "break" that é character there.
--
Stephane