[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
tr(1) with multibyte character support
From: |
Assaf Gordon |
Subject: |
tr(1) with multibyte character support |
Date: |
Fri, 15 Sep 2017 01:15:57 -0600 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 |
Hello,
I'm looking into adding multibyte support to tr(1), and interested in
some feedback.
1. "-C" vs "-c"
---------------
The POSIX tr(1) page says:
"-c Complement the set of values specified by string1.
-C Complement the set of characters specified by string1."
( http://pubs.opengroup.org/onlinepubs/9699919799/utilities/tr.html )
This I take to mean:
"-c" is single-bytes (=values) regardless of locale,
"-C" is multibyte characters, depending on locale.
First,
Is the above correct?
Second,
Assuming it is correct, is the following expected output correct?
The UTF-8 sequence '\316\243' is U+03A3 GREEK CAPITAL LETTER SIGMA 'Σ'.
The UTf-8 sequence '\316\250' is U+03A8 GREEK CAPITAL LETTER PSI 'Ψ'.
POSIX unibyte locale and lower-case "-c":
printf '\316\243\316\250' | LC_ALL=C tr -dc '\316\250'
=> '\316\316\250'
UTF-8 locale but lower-case "-c", input set should be treated
as two separate single-byte octets:
printf '\316\243\316\250' | LC_ALL=en_US.UTf-8 tr -dc '\316\250'
=> '\316\316\250'
POSIX unibyte locale and upper-case "-C", input set should be treated
as two separate single-byte octets:
printf '\316\243\316\250' | LC_ALL=C tr -dC '\316\250'
=> '\316\316\250'
UTF-8 locale with upper-case "-C", input is a one multibyte character:
printf '\316\243\316\250' | LC_ALL=en_US.UTF-8 tr -dC '\316\250'
=> '\316\250'
2. Invalid multibyte sequences in SET1/SET2 parameters
------------------------------------------------------
I assume that invalid multibyte sequences in the *input* file
must be outputed as-is (in accordance with other coreutils programs).
However, what about invalid sequences in SET1/SET2 parameters?
Can we reject them (and fail/refuse to run) ?
That is, in POSIX locale, both of these are valid and mean the same
thing (delete two octet values):
LC_ALL=C tr -d '\316\250'
LC_ALL=C tr -d '\250\316'
But in UTF8 locale, should we accept the invalid sequence:
LC_ALL=en_US.UTF8 tr -d '\250\316'
and treat it (silently) as two separate octets, or should we exit with
an error message (e.g. "SET1 is not valid in this locale") ?
3. backward incompatibility
---------------------------
Also related to the previous item,
I think tr(1) might be a case where adding multibyte support might break
existing scripts, and be seen as a regression by users.
If someone used commands like
tr -d '\200-\377'
tr -d '\316\250'
And these have worked for many years regardless of locale, adding
multibyte support might disrupt this.
What do you think ? perhaps this usage is not so common, and it won't be
too big of a disruption ?
thanks for reading,
- assaf
- tr(1) with multibyte character support,
Assaf Gordon <=