[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: tr(1) with multibyte character support
From: |
Pádraig Brady |
Subject: |
Re: tr(1) with multibyte character support |
Date: |
Fri, 15 Sep 2017 21:31:56 -0700 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 |
On 15/09/17 00:15, Assaf Gordon wrote:
> Hello,
>
> I'm looking into adding multibyte support to tr(1), and interested in
> some feedback.
>
>
> 1. "-C" vs "-c"
> ---------------
>
> The POSIX tr(1) page says:
> "-c Complement the set of values specified by string1.
> -C Complement the set of characters specified by string1."
> ( http://pubs.opengroup.org/onlinepubs/9699919799/utilities/tr.html )
>
> This I take to mean:
> "-c" is single-bytes (=values) regardless of locale,
> "-C" is multibyte characters, depending on locale.
>
> First,
> Is the above correct?
The standard is a bit confusing here but I think the above is correct.
I find it strange that the byte/char distinction is only made for --complement.
Also which one does --complement imply? It probably should be -C I suppose
since I'm guessing -c is an older option left specifying bytes for backwards
compat reasons.
> Second,
> Assuming it is correct, is the following expected output correct?
>
> The UTF-8 sequence '\316\243' is U+03A3 GREEK CAPITAL LETTER SIGMA 'Σ'.
> The UTf-8 sequence '\316\250' is U+03A8 GREEK CAPITAL LETTER PSI 'Ψ'.
>
> POSIX unibyte locale and lower-case "-c":
>
> printf '\316\243\316\250' | LC_ALL=C tr -dc '\316\250'
> => '\316\316\250'
>
ack
>
> UTF-8 locale but lower-case "-c", input set should be treated
> as two separate single-byte octets:
>
> printf '\316\243\316\250' | LC_ALL=en_US.UTf-8 tr -dc '\316\250'
> => '\316\316\250'
>
ack
> POSIX unibyte locale and upper-case "-C", input set should be treated
> as two separate single-byte octets:
>
> printf '\316\243\316\250' | LC_ALL=C tr -dC '\316\250'
> => '\316\316\250'
Right, if hard_locale() == false,
which might not be the case on some setups that assume UTF8
> UTF-8 locale with upper-case "-C", input is a one multibyte character:
>
> printf '\316\243\316\250' | LC_ALL=en_US.UTF-8 tr -dC '\316\250'
> => '\316\250'
ack
> 2. Invalid multibyte sequences in SET1/SET2 parameters
> ------------------------------------------------------
>
> I assume that invalid multibyte sequences in the *input* file
> must be outputed as-is (in accordance with other coreutils programs).
Right. Well we talked about that previously
(and the separate program for preprocessing data)
> However, what about invalid sequences in SET1/SET2 parameters?
> Can we reject them (and fail/refuse to run) ?
>
> That is, in POSIX locale, both of these are valid and mean the same
> thing (delete two octet values):
>
> LC_ALL=C tr -d '\316\250'
> LC_ALL=C tr -d '\250\316'
ack
> But in UTF8 locale, should we accept the invalid sequence:
>
> LC_ALL=en_US.UTF8 tr -d '\250\316'
>
> and treat it (silently) as two separate octets, or should we exit with
> an error message (e.g. "SET1 is not valid in this locale") ?
It would be nice to error to provide feedback for invalid chars, but...
> 3. backward incompatibility
> ---------------------------
>
> Also related to the previous item,
> I think tr(1) might be a case where adding multibyte support might break
> existing scripts, and be seen as a regression by users.
> If someone used commands like
> tr -d '\200-\377'
> tr -d '\316\250'
> And these have worked for many years regardless of locale, adding
> multibyte support might disrupt this.
>
> What do you think ? perhaps this usage is not so common, and it won't be
> too big of a disruption ?
Well it's not silent corruption which is better.
This gets back to my question as to why -C was introduced to
seemingly cater for this ambiguity, while the non complemented case
is left with backwards compat issues like this.
I guess the question boils down to,
Is it better to provide backwards compat by falling back to byte mode for
invalid chars,
or is it better to provide feedback for invalid chars specified in the SET.
Let's look at FreeBSD for comparison:
$ export LC_ALL=en_US.UTF-8
$ printf '\316\243\316\250\n' | tr -d '\316\250'
ΣΨ
$ printf '\316\243\316\250\n' | tr -d '\250\316'
ΣΨ
$ printf '\316\243\316\250\n' | tr -d $'\316\250'
Σ
$ printf '\316\243\316\250\n' | tr -d $'\250\316'
tr: Illegal byte sequence
So you can see that there, tr does not concat the octal escapes to multi byte
chars.
Also it doesn't warn about these ineffective specifications that can thus never
be characters in the input. Also that's in opposition to the POSIX standard you
linked which states:
"\octal
...
Multi-byte characters require multiple, concatenated escape sequences of this
type,
including the leading <backslash> for each byte."
Maybe FreeBSD just ignored this part of the standard due to the backwards
incompat issue and the ease which one can specify multi-byte chars directly.
I.E. never treat \octal escapes as part of a multi-byte char, only treat as
values?
Here's another related part of the standard:
"The earlier version also said that octal sequences referred to collating
elements
and could be placed adjacent to each other to specify multi-byte characters.
However, it was noted that this caused ambiguities because tr would not be able
to tell whether adjacent octal sequences were intending to specify multi-byte
characters
or multiple single byte characters. POSIX.1-2008 specifies that octal sequences
always
refer to single byte binary values when used to specify an endpoint of a range
of collating elements."
Right, so I'm leaning towards the FreeBSD behavior and having octal sequences
always refer to single byte characters.
cheers,
Pádraig