bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Patch to fix /[A-Z]/ and internationalization bug


From: Sam Trenholme
Subject: Patch to fix /[A-Z]/ and internationalization bug
Date: Thu, 2 Nov 2006 18:46:59 -0600 (CST)

Arnold (or whoever),

I am willing to sign over this patch that I feel is
the best solution to the problems with /[A-Z]/ and
case sensitivity in non-C locales.

I am also willing to make changes to this patch so we
can make it a part of Awk.

- Sam

__________________________________________________
Correo Yahoo!
Espacio para todos tus mensajes, antivirus y antispam ¡gratis! 
Regístrate ya - http://correo.espanol.yahoo.com/ 
This patch makes it so regular expression ranges, such as /[A-Z]/,
do not break if internationalization is enabled.  The problem is
this: Many international locales place uppercase letters next
to lowercase letters.  While this results in a more sensible
"ls" output, this also breaks scripts that assume /[A-Z]/ matches 
only upper case and /[a-z]/ only matches lower case.

The way the patch works around this issue is to use traditional
ASCII ordering of characters if both characters in a range are
ASCII characters.  If either of the characters in a range are
not ASCII, such as /[Á-Z]/ (the first letter in this range
is an A with an accute accent), this code will use the wcscoll()
routine to determine the range.  

Some issues before this can become a part of Gawk:

1) I have to sign the paperwork assigning copyright to the FSF.
   For legal reasons, I have to physically sign a paper and give it
   to them.

2) This may break on non-ASCII systems (as I recall, Gawk still has
   support for non-ASCII systems).

3) Maybe have an environmental variable with reenables the old Gawk
   behavior.  I'll have to use a static variable so we don't do an
   expensive getenv() call every time we look at a character.

- Sam

*** gawk-3.1.5/dfa.c.orig       2005-07-26 13:07:43.000000000 -0500
--- gawk-3.1.5/dfa.c    2006-11-02 15:32:41.000000000 -0600
***************
*** 2638,2646 ****
        wcbuf[2] = work_mbc->range_sts[i];
        wcbuf[4] = work_mbc->range_ends[i];
  
!       if (wcscoll(wcbuf, wcbuf+2) >= 0 &&
!         wcscoll(wcbuf+4, wcbuf) >= 0)
!       goto charset_matched;
      }
  
    /* match with a character?  */
--- 2638,2663 ----
        wcbuf[2] = work_mbc->range_sts[i];
        wcbuf[4] = work_mbc->range_ends[i];
  
!       /* If both characters are ASCII characters, we use the ASCII
!        * ordering of the characters to determine the range.  This way,
!        * i18n doesn't break regexes like /[A-Z]/ (which is supposed to 
!        * mean "upper case only", and should never match lower-case) */
!       if (wcbuf[2] < 128 && wcbuf[4] < 128) 
!       {
!       if (wcbuf[0] >= wcbuf[2] &&
!           wcbuf[4] >= wcbuf[0]) 
!         {
!           goto charset_matched;
!         }
!       }
!       else 
!       {
!       if (wcscoll(wcbuf, wcbuf+2) >= 0 &&
!           wcscoll(wcbuf+4, wcbuf) >= 0) 
!         {
!           goto charset_matched;
!         }
!       }
      }
  
    /* match with a character?  */
*** gawk-3.1.5/doc/gawk.texi.orig       2006-11-02 15:40:43.000000000 -0600
--- gawk-3.1.5/doc/gawk.texi    2006-11-02 16:26:02.000000000 -0600
***************
*** 3830,3876 ****
  @section Where You Are Makes A Difference
  
  Modern systems support the notion of @dfn{locales}: a way to tell
! the system about the local character set and language.  The current
! locale setting can affect the way regexp matching works, often
! in surprising ways.  In particular, many locales do case-insensitive
! matching, even when you may have specified characters of only
! one particular case.
! 
! The following example uses the @code{sub} function, which
! does text replacement
! (@pxref{String Functions}).
! Here, the intent is to remove trailing uppercase characters:
! 
! @example
! $ echo something1234abc | gawk '@{ sub("[A-Z]*$", ""); print @}'
! @print{} something1234
! @end example
! 
! @noindent
! This output is unexpected, since the @samp{abc} at the end of 
@samp{something1234abc}
! should not normally match @samp{[A-Z]*}.  This result is due to the
! locale setting (and thus you may not see it on your system).
! There are two fixes.  The first is to use the POSIX character
! class @samp{[[:upper:]]}, instead of @samp{[A-Z]}.
! The second is to change the locale setting in the environment,
! before running @command{gawk},
! by using the shell statements:
! 
! @example
! LANG=C LC_ALL=C
! export LANG LC_ALL
! @end example
! 
! The setting @samp{C} forces @command{gawk} to behave in the traditional
! Unix manner, where case distinctions do matter.
! You may wish to put these statements into your shell startup file,
! e.g., @file{$HOME/.profile}.
! 
! Similar considerations apply to other ranges.  For example,
! @samp{["-/]} is perfectly valid in ASCII, but is not valid in many
! Unicode locales, such as @samp{en_US.UTF-8}.  (In general, such
! ranges should be avoided; either list the characters individually,
! or use a POSIX character class such as @samp{[[:punct:]]}.)
  
  For the normal case of @samp{RS = "\n"}, the locale is largely irrelevant.
  For other single byte record separators, using @samp{LC_ALL=C} will give you
--- 3830,3858 ----
  @section Where You Are Makes A Difference
  
  Modern systems support the notion of @dfn{locales}: a way to tell
! the system about the local character set and language.  In particular, 
! many locales do case-insensitive matching, even when you may have 
! specified characters of only one particular case.
! 
! In order to be compatible with traditional AWK scripts that
! assume an ASCII ordering of letters, if both characters in a 
! regular expression range, such as @samp{[A-Z]} are ASCII, Gawk will 
! use ASCII ordering to determine the characters in the range.  This, in
! particular, preserves the case sensitivity that 
! traditional AWK scripts have utilized.
! 
! This behavior is different than the behavior in earlier versions of
! Gawk.  In earlier versions of Gawk, the current locale always 
! determined what characters to put in a regular expression
! range.  This behavior gave surprising results: Previously case-sensitive
! character ranges became case-insensitive, breaking AWK scripts.
! 
! One consequence of this change is that @samp{[A-Za-z]} no longer
! matches accented letters in non-English locales.  If this behavior
! is needed, use the POSIX character class @samp{[[:alpha:]]}, which
! matches all alphabetic characters.  Another option is to use an accented
! character in the regular expression range, which will reinstate
! Gawk's older behavior.
  
  For the normal case of @samp{RS = "\n"}, the locale is largely irrelevant.
  For other single byte record separators, using @samp{LC_ALL=C} will give you

reply via email to

[Prev in Thread] Current Thread [Next in Thread]