[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?

From: Paolo Bonzini
Subject: Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?
Date: Fri, 28 Jun 2013 15:05:29 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130514 Thunderbird/17.0.6

Il 28/06/2013 14:49, Eli Zaretskii ha scritto:
> > > When being consistent means being buggy, I don't want the consistency.
> > > I want the bug solved in all the programs I use, but if it takes time
> > > to do that, I will be glad in the meantime to use some programs that
> > > don't have that bug, i.e. are "inconsistent".
> > 
> > I will be less glad to move a regex or piece of code from one to
> > another, and find inconsistency.
> You should report a bug in that case.

In the case of sed, I'll gladly to direct the reporter to the "Non-bugs"
section of the manual.  Which also explains why you should anyway use

`[a-z]' is case insensitive
`s/.*//' does not clear pattern space

  You are encountering problems with locales.  POSIX mandates that `[a-z]'
  uses the current locale's collation order -- in C parlance, that means
  strcoll(3) instead of strcmp(3).  Some locales have a case insensitive
  strcoll, others don't.

  Another problem is that [a-z] tries to use collation symbols.  This
  only happens if you are on the GNU system, using GNU libc's regular
  expression matcher instead of compiling the one supplied with GNU sed.
  In a Danish locale, for example, the regular expression `^[a-z]$'
  matches the string `aa', because `aa' is a single collating symbol that
  comes after `a' and before `b'; `ll' behaves similarly in Spanish
  locales, or `ij' in Dutch locales.

  Another common localization-related problem happens if your input stream
  includes invalid multibyte sequences.  POSIX mandates that such
  sequences are _not_ matched by `.', so that `s/.*//' will not clear
  pattern space as you would expect.  In fact, there is no way to clear
  sed's buffers in the middle of the script in most multibyte locales
  (including UTF-8 locales).  For this reason, GNU sed provides a `z'
  command (for `zap') as an extension.

  However, to work around both of these problems, which may cause bugs
  in shell scripts, you can set the LC_ALL environment variable to `C',
  or set the locale on a more fine-grained basis with the other LC_*
  environment variables.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]