[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: documentation bug re character range expressions

From: Greg Wooledge
Subject: Re: documentation bug re character range expressions
Date: Fri, 3 Jun 2011 13:03:32 -0400
User-agent: Mutt/

On Fri, Jun 03, 2011 at 09:12:07AM -0700, Marcel (Felix) Giannelia wrote:
> And yours looks broken -- how does
> echo Hello World | tr A-Z a-z
> result in a bunch of non-ASCII characters?

I explain it in a bit on http://mywiki.wooledge.org/locale

In a bit more depth: in ASCII, the characters A-Z are

 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

while a-z are

 a b c d e f g h i j k l m n o p q r s t u v w x y z

So when you map from one to the other, you get what you expect.

In HP-UX's en_US.iso88591 locale, the characters are in a COMPLETELY
different order.  You can't easily figure out what that order is, because
it's not documented anywhere, but by using tricks you can beat it into
submission.  Instead of having two separate ranges from a to z, and from
A to Z, there's just one big range from A to z (actually þ) which looks
something like:

 A a Á á À à Â â Ä ä Å å Ã ã Æ æ B b C c Ç ç D d Ð ð E e É .... Z z Þ þ

So when you write A-Z you mean A a Á ... Z.  And a-z means a Á ... Z z.
In other words, when you tell tr to map from A-Z to a-z all you're
actually doing is shifting the map one position to the right.  So H
becomes h, e becomes É, l becomes M, and so on.  Whereas in ASCII,
mapping from A-Z to a-z shifts everything 32 to the right (the
difference between 'A' and 'a'), so H becomes h and so on.

The GNU people apparently didn't like this, so they've done odd things
that I do not fully understand -- witness your results on GNU/Linux
systems with GNU grep.  At first glance, those results seem good -- I
mean, it's *more intuitive* that you should be able to write A-C to
mean ABC, right?  Unfortunately, like many extensions that GNU puts in
their software, the result is a bunch of scripts that only work on
GNU systems.  Portability has been destroyed.  People don't even know
that an issue of portability *exists*, because they just assume that
all the other operating systems work the same way.

Just for kicks, we also have an IRIX box:

# uname -sr
IRIX64 6.5
# export LANG=en_US.ISO8859-15
# echo Hello World | tr A-Z a-z
hello world

So there's at least some precedent for the GNU implementation.  Though
I must admit I don't know IRIX all that well.

Finally, http://pubs.opengroup.org/onlinepubs/9699919799/utilities/tr.html
says of the range notation:

  In locales other than the POSIX locale, this construct has unspecified

So, you really should not be using it at all, unless you set LANG=C

reply via email to

[Prev in Thread] Current Thread [Next in Thread]