[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: documentation bug re character range expressions

From: Chet Ramey
Subject: Re: documentation bug re character range expressions
Date: Thu, 09 Jun 2011 14:31:10 -0400
User-agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv: Gecko/20110414 Lightning/1.0b2 Thunderbird/3.1.10

On 6/8/11 5:45 PM, Marcel (Felix) Giannelia wrote:
> On 07/06/11 13:45, Chet Ramey wrote:
>> [...]
>> I'm not going to add much to this discussion except to note that I believe
>> `sorts' is correct.  Consider the following script:
>> export LC_COLLATE=de_DE.UTF-8
>> printf "%s\n" {A..Z} {a..z} | sort | tr $'\n' ' '
>> echo
> That's really interesting -- and not just your intended point, but what
> happens with those ranges if you take 'sort' out of the pipe. The curly
> brace {A..Z} syntax doesn't obey the locale! Observe:

No, it doesn't.  It's not part of any standard, and it's not part of
pattern matching, so I implemented it with the traditional C semantics
because that seemed the most straightforward.

I'd also argue that it's not really feasible to implement it any other
way, since there's no standard way to enumerate a collating sequence
from C using Posix interfaces.

> $ printf "%s\n" {A..Z} {a..z} | sort | tr $'\n' ' '
> a A b B c C d D e E f F g G h H i I j J k K l L m M n N o O p P q Q r R s S
> t T u U v V w W x X y Y z Z
> (as you expect, but...)
> $ printf "%s\n" {A..Z} {a..z} | tr $'\n' ' '
> A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l
> m n o p q r s t u v w x y z
> So, if I want C-like behaviour out of "[a-z]*", I can write it as
> "{a..z}*"? Is that a bug or a feature?

Neither.  They are two different features.

>> [...]
>> That sure looks like `C' doesn't sort between `a' and `c' in de_DE.UTF-8
>> and en_GB.UTF-8.
> Not in a case like that, with single-character strings. But my point was
> that it's possible for 'C' to sort between 'a' and 'c' in longer strings.

True.  However, a bracket expression matches only a single character.

> I realize it's pedantic, but documentation should be pedantically accurate
> :)  I would be OK with changing the man page so it says, "sorts between
> those two characters in a list of single-character strings", as that would
> also describe the current behaviour.

But it only matches a single character, by definition.  It should not be
necessary to specify the list of single-character strings part.

> "Within a bracket expression, a range expression consists of two characters
> separated by a hyphen. It matches any single character that sorts between
> the two characters, inclusive, using the locale's collating sequence and
> character set. For example, in the default C locale, [a-d] is equivalent to
> [abcd]. Many locales sort characters in dictionary order, and in these
> locales [a-d] is typically not equivalent to [abcd]; it might be equivalent
> to [aBbCcDd], for example. To obtain the traditional interpretation of
> bracket expressions, you can use the C locale by setting the LC_ALL
> environment variable to the value C."

The bash texinfo documentation says just about the same thing.

``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, ITS, CWRU    address@hidden    http://cnswww.cns.cwru.edu/~chet/

reply via email to

[Prev in Thread] Current Thread [Next in Thread]