bug-sed
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#31526: Range [a-z] does not follow collate order from locale.


From: Assaf Gordon
Subject: bug#31526: Range [a-z] does not follow collate order from locale.
Date: Sat, 19 May 2018 20:13:00 -0600
User-agent: NeoMutt/20170113 (1.7.2)

tag 31526 notabug
close 31526
thanks

Hello,

On Fri, May 18, 2018 at 05:58:05PM -0400, Bize Ma wrote:
> With a locale set to en_US.utf8 it is expected that the collating order is
> this:
> 
>     $ printf '%b' $(printf '\\U%x\\n' {32..127}) | sort | tr -d '\n'
>     `^~<=>| _-,;:!?/.'"()address@hidden&#%+0123456789aAbBcCdDeEfFgGhHiIjJ
> kKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ

While in practice this is correct on all GNU/linux systems which
use glibc, there is no officially documented collation order for
punctuation marks - it might differ on other systems. Please see here:
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=23677#14

> It is expected that a range [a-z] will match 'aAbBcCdD…', all lower and
> upper letters.
> But it isn't:

It should not be "expected". I don't think it is documented to be
so anywhere in GNU programs. Both sed's and grep's manuals contain
the following text:

    In other locales, the sorting sequence is not specified, and ‘[a-d]’
    might be equivalent to ‘[abcd]’ or to ‘[aBbCcDd]’, or it might fail to
    match any character, or the set of characters that it matches might
    even be erratic.

https://www.gnu.org/software/sed/manual/sed.html#Multibyte-regexp-character-classes
https://www.gnu.org/software/grep/manual/html_node/Character-Classes-and-Bracket-Expressions.html

Furthermore, in POSIX 2008 standard range expressions are
underfined for locales other than "C/POSIX", see this comment by Eric Blake
(also the entire bug report might be of interest to this topic):
https://bugzilla.redhat.com/show_bug.cgi?id=583011#c24

> However, the range [a-Z] does match all letters, lower or upper:
> 
>     $ printf '%b' $(printf '\\U%x' {32..127}) | sed 's/[^a-Z]//g'
>     ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

I would recommend avoiding mixing upper-lower case in regex
ranges, as the result might be unexpected. Compare the following:

  $ echo '[' | LC_ALL=en_CA.utf8 sed -n '/[a-Z]/p'
  [[ no output, no failure ]]

  $ echo '[' | LC_ALL=C sed -n '/[a-Z]/p'
  sed: -e expression #1, char 7: Invalid range end

  $ echo '[' | LC_ALL=en_CA.utf8 sed -n '/[A-z]/p'
  sed: -e expression #1, char 7: Invalid range end

  $ echo '[' | LC_ALL=C sed -n '/[A-z]/p'
  [


> If this is the correct way in which sed should work, then, if you please:

Yes, it is.

>     - What is the rationale leading to such decision?.

The bug reports linked above contain long discussions about it.

Please also see the following thread, which promoted the restriction
of "sane regex ranges" - meaning ASCII order alone (and applies to gawk,
grep, sed and other programs using gnulib's regex engine):

https://lists.gnu.org/archive/html/bug-gnulib/2011-06/msg00200.html

>     - Where is it documented?.

The links above to the sed and grep manuals.

>     - Where is it implemented in the code?.

I think a good place to start is gnulib's DFA regex engine,
here:
https://opengrok.housegordon.com/source/xref/gnulib/lib/dfa.c
or here:
http://git.savannah.gnu.org/cgit/gnulib.git/tree/lib/dfa.c

Search for the comment 'build range characters' for a starting point.

Both gnu grep and sed use this code.

>     - Why does the manual document otherwise?.

Errors in the manual are always a possibility.
If you spot such an error, or an example showing incorrect
usage/output - please let us know where it is (e.g. a link
to a manual page  / section).

As such, I'm marking this as "not a bug" and closing the ticket,
but discussion can continue by replying to this thread.

regards,
 - assaf






reply via email to

[Prev in Thread] Current Thread [Next in Thread]