[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [coreutils] tr: case mapping anomaly
From: |
Pádraig Brady |
Subject: |
Re: [coreutils] tr: case mapping anomaly |
Date: |
Sat, 25 Sep 2010 07:52:39 +0100 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.8) Gecko/20100227 Thunderbird/3.0.3 |
On 25/09/10 00:22, Eric Blake wrote:
> On 09/24/2010 04:47 PM, Pádraig Brady wrote:
>> I was just looking at a bug reported to fedora there where this abort()s
>>
>> $ LC_ALL=en_US tr '[:upper:] ' '[:lower:]'
>
> Behavior is already unspecified by POSIX when string1 is longer than
> string2. But given what POSIX does say:
>
> "When both the -d and -s options are specified, any of the character
> class names shall be accepted in string2. Otherwise, only character
> class names lower or upper are valid in string2 and then only if the
> corresponding character class ( upper and lower, respectively) is
> specified in the same relative position in string1. Such a specification
> shall be interpreted as a request for case conversion. When [: lower:]
> appears in string1 and [: upper:] appears in string2, the arrays shall
> contain the characters from the toupper mapping in the LC_CTYPE category
> of the current locale. When [: upper:] appears in string1 and [: lower:]
> appears in string2, the arrays shall contain the characters from the
> tolower mapping in the LC_CTYPE category of the current locale. The
> first character from each mapping pair shall be in the array for string1
> and the second character from each mapping pair shall be in the array
> for string2 in the same relative position.
>
> Except for case conversion, the characters specified by a character
> class expression shall be placed in the array in an unspecified order.
> ...
>
> However, in a case conversion, as described previously, such as:
>
> tr -s '[:upper:]' '[:lower:]'
>
> the last operand's array shall contain only those characters defined as
> the second characters in each of the toupper or tolower character pairs,
> as appropriate."
>
>
>
> I interpret this to mean that even though there are 59 lower and 56
> upper in en_US, there are a fixed number of toupper case-mapping pairs,
> and there are likewise a fixed number of tolower case-mapping pairs.
> Therefore, [:upper:] and [:lower:] expand to the same number of array
> entries (whether that is 59 pairs or 56 pairs is irrelevant), and
> mappings like "tr '[:lower:] ' '[:upper:]_'" must unambiguously convert
> space to underscore and also guarantee that no lower-case letter becomes
> an underscore.
Thanks for digging up the relevant POSIX bits.
Yes I agree that '[:lower:]' '[:upper:]' should
be treated as a unit and not leak into adjacent elements.
>
> Your question is basically what should we do on the unspecified behavior
> of '[:lower:] ' '[:upper:]', where string1 is longer than string2, since
> that falls outside the bounds of POSIX.
>
>> I.E. 0xDE (the last upper char) is output from:
>>
>> $ echo "_ _" | LC_ALL=en_US ./src/tr '[:lower:] ' '[:upper:]'
>
> That matches the behavior we choose in all other instances where string1
> is longer than string2, where GNU tr follows BSD behavior of padding the
> last character of string2 to meet the length of string1.
>
> But, since POSIX is clear that the order of [:upper:] mappings is
> unspecified, I agree that it is not a good guarantee to the user of
> which byte gets duplicated to fill out the conversion, and we are better
> off rejecting that attempted usage.
>
>>
>> That seems quite inconsistent given that other classes
>> are not allowed in string 2 when translating:
>>
>> $ echo "ab ." | LANG=en_US tr '[:digit:]' '[:alpha:]'
>> tr: when translating, the only character classes that may appear in
>> string2 are `upper' and `lower'
>>
>> For consistency I think it better to keep the classes
>> in string 2 just for case mapping, and do something like:
>>
>> $ tr '[:upper:] ' '[:lower:]'
>> tr: when not truncating set1, a character class can't be
>> the last entity in string2
>
> I'd rather see it phrased:
>
> When string2 is shorter than string1, a character class can't be the
> last entity in string2.
OK. That is a bit clearer.
>> Note BSD allows extending the above, but that's at least
>> consistent with any class being allowed in string2.
>> I.E. this is disallowed by coreutils but Ok on BSD:
>>
>> $ echo "1 2" | LC_ALL=en_US.iso-8859-1 tr ' ' '[:alpha:]'
>> 1A2
>
> The BSD behavior violates an explicit POSIX wording; we can't do an
> extension like that without either turning on a POSIXLY_CORRECT check or
> adding a command line option, neither of which I think is necessary. So
> I see no reason to copy the BSD behavior of allowing any character class.
Yes I agree. I was just pointing out what BSD does here.
>> Is it OK to change tr like this?
>> I can't see anything depending on that.
>
> Seems reasonable to me, once we decide on the error message wording.
Great, I'll change it as above.
cheers,
Pádraig.
- [coreutils] tr: case mapping anomaly, Pádraig Brady, 2010/09/24
- Re: [coreutils] tr: case mapping anomaly, Eric Blake, 2010/09/24
- Re: [coreutils] tr: case mapping anomaly,
Pádraig Brady <=
- Re: [coreutils] tr: case mapping anomaly, Jim Meyering, 2010/09/25
- Re: [coreutils] tr: case mapping anomaly, Eric Blake, 2010/09/29
- Re: [coreutils] tr: case mapping anomaly, Pádraig Brady, 2010/09/29
- Re: [coreutils] tr: case mapping anomaly, Eric Blake, 2010/09/29
- Re: [coreutils] tr: case mapping anomaly, Eric Blake, 2010/09/29