[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [coreutils] tr: case mapping anomaly
From: |
Eric Blake |
Subject: |
Re: [coreutils] tr: case mapping anomaly |
Date: |
Fri, 24 Sep 2010 17:22:34 -0600 |
User-agent: |
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.9) Gecko/20100907 Fedora/3.1.3-1.fc13 Mnenhy/0.8.3 Thunderbird/3.1.3 |
On 09/24/2010 04:47 PM, Pádraig Brady wrote:
I was just looking at a bug reported to fedora there where this abort()s
$ LC_ALL=en_US tr '[:upper:] ' '[:lower:]'
Behavior is already unspecified by POSIX when string1 is longer than
string2. But given what POSIX does say:
"When both the -d and -s options are specified, any of the character
class names shall be accepted in string2. Otherwise, only character
class names lower or upper are valid in string2 and then only if the
corresponding character class ( upper and lower, respectively) is
specified in the same relative position in string1. Such a specification
shall be interpreted as a request for case conversion. When [: lower:]
appears in string1 and [: upper:] appears in string2, the arrays shall
contain the characters from the toupper mapping in the LC_CTYPE category
of the current locale. When [: upper:] appears in string1 and [: lower:]
appears in string2, the arrays shall contain the characters from the
tolower mapping in the LC_CTYPE category of the current locale. The
first character from each mapping pair shall be in the array for string1
and the second character from each mapping pair shall be in the array
for string2 in the same relative position.
Except for case conversion, the characters specified by a character
class expression shall be placed in the array in an unspecified order.
...
However, in a case conversion, as described previously, such as:
tr -s '[:upper:]' '[:lower:]'
the last operand's array shall contain only those characters defined as
the second characters in each of the toupper or tolower character pairs,
as appropriate."
I interpret this to mean that even though there are 59 lower and 56
upper in en_US, there are a fixed number of toupper case-mapping pairs,
and there are likewise a fixed number of tolower case-mapping pairs.
Therefore, [:upper:] and [:lower:] expand to the same number of array
entries (whether that is 59 pairs or 56 pairs is irrelevant), and
mappings like "tr '[:lower:] ' '[:upper:]_'" must unambiguously convert
space to underscore and also guarantee that no lower-case letter becomes
an underscore.
Your question is basically what should we do on the unspecified behavior
of '[:lower:] ' '[:upper:]', where string1 is longer than string2, since
that falls outside the bounds of POSIX.
I.E. 0xDE (the last upper char) is output from:
$ echo "_ _" | LC_ALL=en_US ./src/tr '[:lower:] ' '[:upper:]'
That matches the behavior we choose in all other instances where string1
is longer than string2, where GNU tr follows BSD behavior of padding the
last character of string2 to meet the length of string1.
But, since POSIX is clear that the order of [:upper:] mappings is
unspecified, I agree that it is not a good guarantee to the user of
which byte gets duplicated to fill out the conversion, and we are better
off rejecting that attempted usage.
That seems quite inconsistent given that other classes
are not allowed in string 2 when translating:
$ echo "ab ." | LANG=en_US tr '[:digit:]' '[:alpha:]'
tr: when translating, the only character classes that may appear in
string2 are `upper' and `lower'
For consistency I think it better to keep the classes
in string 2 just for case mapping, and do something like:
$ tr '[:upper:] ' '[:lower:]'
tr: when not truncating set1, a character class can't be
the last entity in string2
I'd rather see it phrased:
When string2 is shorter than string1, a character class can't be the
last entity in string2.
Note BSD allows extending the above, but that's at least
consistent with any class being allowed in string2.
I.E. this is disallowed by coreutils but Ok on BSD:
$ echo "1 2" | LC_ALL=en_US.iso-8859-1 tr ' ' '[:alpha:]'
1A2
The BSD behavior violates an explicit POSIX wording; we can't do an
extension like that without either turning on a POSIXLY_CORRECT check or
adding a command line option, neither of which I think is necessary. So
I see no reason to copy the BSD behavior of allowing any character class.
Is it OK to change tr like this?
I can't see anything depending on that.
Seems reasonable to me, once we decide on the error message wording.
--
Eric Blake address@hidden +1-801-349-2682
Libvirt virtualization library http://libvirt.org
- [coreutils] tr: case mapping anomaly, Pádraig Brady, 2010/09/24
- Re: [coreutils] tr: case mapping anomaly,
Eric Blake <=
- Re: [coreutils] tr: case mapping anomaly, Pádraig Brady, 2010/09/25
- Re: [coreutils] tr: case mapping anomaly, Jim Meyering, 2010/09/25
- Re: [coreutils] tr: case mapping anomaly, Eric Blake, 2010/09/29
- Re: [coreutils] tr: case mapping anomaly, Pádraig Brady, 2010/09/29
- Re: [coreutils] tr: case mapping anomaly, Eric Blake, 2010/09/29
- Re: [coreutils] tr: case mapping anomaly, Eric Blake, 2010/09/29