Re: [coreutils] tr: case mapping anomaly

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [coreutils] tr: case mapping anomaly

From:	Eric Blake
Subject:	Re: [coreutils] tr: case mapping anomaly
Date:	Fri, 24 Sep 2010 17:22:34 -0600
User-agent:	Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.9) Gecko/20100907 Fedora/3.1.3-1.fc13 Mnenhy/0.8.3 Thunderbird/3.1.3

On 09/24/2010 04:47 PM, Pádraig Brady wrote:

I was just looking at a bug reported to fedora there where this abort()s

  $ LC_ALL=en_US tr '[:upper:] ' '[:lower:]'

Behavior is already unspecified by POSIX when string1 is longer thanstring2. But given what POSIX does say:

"When both the -d and -s options are specified, any of the characterclass names shall be accepted in string2. Otherwise, only characterclass names lower or upper are valid in string2 and then only if thecorresponding character class ( upper and lower, respectively) isspecified in the same relative position in string1. Such a specificationshall be interpreted as a request for case conversion. When [: lower:]appears in string1 and [: upper:] appears in string2, the arrays shallcontain the characters from the toupper mapping in the LC_CTYPE categoryof the current locale. When [: upper:] appears in string1 and [: lower:]appears in string2, the arrays shall contain the characters from thetolower mapping in the LC_CTYPE category of the current locale. Thefirst character from each mapping pair shall be in the array for string1and the second character from each mapping pair shall be in the arrayfor string2 in the same relative position.

Except for case conversion, the characters specified by a characterclass expression shall be placed in the array in an unspecified order.

...

However, in a case conversion, as described previously, such as:

tr -s '[:upper:]' '[:lower:]'

the last operand's array shall contain only those characters defined asthe second characters in each of the toupper or tolower character pairs,as appropriate."

I interpret this to mean that even though there are 59 lower and 56upper in en_US, there are a fixed number of toupper case-mapping pairs,and there are likewise a fixed number of tolower case-mapping pairs.Therefore, [:upper:] and [:lower:] expand to the same number of arrayentries (whether that is 59 pairs or 56 pairs is irrelevant), andmappings like "tr '[:lower:] ' '[:upper:]_'" must unambiguously convertspace to underscore and also guarantee that no lower-case letter becomesan underscore.

Your question is basically what should we do on the unspecified behaviorof '[:lower:] ' '[:upper:]', where string1 is longer than string2, sincethat falls outside the bounds of POSIX.

I.E. 0xDE (the last upper char) is output from:

  $ echo "_ _" | LC_ALL=en_US ./src/tr '[:lower:] ' '[:upper:]'

That matches the behavior we choose in all other instances where string1is longer than string2, where GNU tr follows BSD behavior of padding thelast character of string2 to meet the length of string1.

But, since POSIX is clear that the order of [:upper:] mappings isunspecified, I agree that it is not a good guarantee to the user ofwhich byte gets duplicated to fill out the conversion, and we are betteroff rejecting that attempted usage.


That seems quite inconsistent given that other classes
are not allowed in string 2 when translating:

  $ echo "ab ." | LANG=en_US tr '[:digit:]' '[:alpha:]'
  tr: when translating, the only character classes that may appear in
  string2 are `upper' and `lower'

For consistency I think it better to keep the classes
in string 2 just for case mapping, and do something like:

  $ tr '[:upper:] ' '[:lower:]'
  tr: when not truncating set1, a character class can't be
  the last entity in string2


I'd rather see it phrased:

When string2 is shorter than string1, a character class can't be thelast entity in string2.


Note BSD allows extending the above, but that's at least
consistent with any class being allowed in string2.
I.E. this is disallowed by coreutils but Ok on BSD:

  $ echo "1 2" | LC_ALL=en_US.iso-8859-1 tr ' ' '[:alpha:]'
  1A2

The BSD behavior violates an explicit POSIX wording; we can't do anextension like that without either turning on a POSIXLY_CORRECT check oradding a command line option, neither of which I think is necessary. SoI see no reason to copy the BSD behavior of allowing any character class.


Is it OK to change tr like this?
I can't see anything depending on that.


Seems reasonable to me, once we decide on the error message wording.

--
Eric Blake   address@hidden    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

[Prev in Thread]

Current Thread

[Next in Thread]

[coreutils] tr: case mapping anomaly, Pádraig Brady, 2010/09/24
- Re: [coreutils] tr: case mapping anomaly, Eric Blake <=
  - Re: [coreutils] tr: case mapping anomaly, Pádraig Brady, 2010/09/25
  - Re: [coreutils] tr: case mapping anomaly, Jim Meyering, 2010/09/25
    - Re: [coreutils] tr: case mapping anomaly, Pádraig Brady, 2010/09/28
    - Re: [coreutils] tr: case mapping anomaly, Jim Meyering, 2010/09/29
    - Re: [coreutils] tr: case mapping anomaly, Pádraig Brady, 2010/09/29
    - Re: [coreutils] tr: case mapping anomaly, Eric Blake, 2010/09/29
    - Re: [coreutils] tr: case mapping anomaly, Pádraig Brady, 2010/09/29
    - Re: [coreutils] tr: case mapping anomaly, Eric Blake, 2010/09/29
    - Re: [coreutils] tr: case mapping anomaly, Eric Blake, 2010/09/29

Prev by Date: [coreutils] tr: case mapping anomaly
Next by Date: Re: [coreutils] tr: case mapping anomaly
Previous by thread: [coreutils] tr: case mapping anomaly
Next by thread: Re: [coreutils] tr: case mapping anomaly
Index(es):
- Date
- Thread