coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [coreutils] tr: case mapping anomaly


From: Jim Meyering
Subject: Re: [coreutils] tr: case mapping anomaly
Date: Sat, 25 Sep 2010 08:53:45 +0200

Eric Blake wrote:
> On 09/24/2010 04:47 PM, Pádraig Brady wrote:
>> I was just looking at a bug reported to fedora there where this abort()s
>>
>>   $ LC_ALL=en_US tr '[:upper:] ' '[:lower:]'

Ouch!  Thanks for reporting it here.
How many more bugs lurk in tr...
Consolation: this one is a failure to diagnose invalid inputs.

...
> I interpret this to mean that even though there are 59 lower and 56
> upper in en_US, there are a fixed number of toupper case-mapping
> pairs, and there are likewise a fixed number of tolower case-mapping
> pairs. Therefore, [:upper:] and [:lower:] expand to the same number of
> array entries (whether that is 59 pairs or 56 pairs is irrelevant),
> and mappings like "tr '[:lower:] ' '[:upper:]_'" must unambiguously
> convert space to underscore and also guarantee that no lower-case
> letter becomes an underscore.
>
> Your question is basically what should we do on the unspecified
> behavior of '[:lower:] ' '[:upper:]', where string1 is longer than
> string2, since that falls outside the bounds of POSIX.

Right.

>> I.E. 0xDE (the last upper char) is output from:
>>
>>   $ echo "_ _" | LC_ALL=en_US ./src/tr '[:lower:] ' '[:upper:]'
>
> That matches the behavior we choose in all other instances where
> string1 is longer than string2, where GNU tr follows BSD behavior of
> padding the last character of string2 to meet the length of string1.
>
> But, since POSIX is clear that the order of [:upper:] mappings is
> unspecified, I agree that it is not a good guarantee to the user of
> which byte gets duplicated to fill out the conversion, and we are
> better off rejecting that attempted usage.
>
>>
>> That seems quite inconsistent given that other classes
>> are not allowed in string 2 when translating:
>>
>>   $ echo "ab ." | LANG=en_US tr '[:digit:]' '[:alpha:]'
>>   tr: when translating, the only character classes that may appear in
>>   string2 are `upper' and `lower'
>>
>> For consistency I think it better to keep the classes
>> in string 2 just for case mapping, and do something like:
>>
>>   $ tr '[:upper:] ' '[:lower:]'
>>   tr: when not truncating set1, a character class can't be
>>   the last entity in string2
>
> I'd rather see it phrased:
>
> When string2 is shorter than string1, a character class can't be the
> last entity in string2.

Thanks, I find it easier to read when string1 and string2 are
listed in order -- and this applies only when translating.
How about this?

    When translating with string1 longer than string2,
    the latter string must not end with a character class.

>> Note BSD allows extending the above, but that's at least
>> consistent with any class being allowed in string2.
>> I.E. this is disallowed by coreutils but Ok on BSD:
>>
>>   $ echo "1 2" | LC_ALL=en_US.iso-8859-1 tr ' ' '[:alpha:]'
>>   1A2
>
> The BSD behavior violates an explicit POSIX wording; we can't do an
> extension like that without either turning on a POSIXLY_CORRECT check
> or adding a command line option, neither of which I think is
> necessary.  So I see no reason to copy the BSD behavior of allowing
> any character class.

Yes.  I deliberately opted not to provide the BSD behavior,
because it cannot be portable.

>> Is it OK to change tr like this?
>> I can't see anything depending on that.
>
> Seems reasonable to me, once we decide on the error message wording.

Yes.  Thanks for bringing this up and dealing with it.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]