bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

gawk 3.0.98 documentation suggestion for range expressions


From: Paul Eggert
Subject: gawk 3.0.98 documentation suggestion for range expressions
Date: Wed, 16 May 2001 01:06:13 -0700 (PDT)

The Gawk 3.0.98 manual uses range expressions in a few places, but
these are not portable outside the C locale, and should therefore be
avoided in portable examples.  POSIX 1003.1-200x draft 6 says
that range expressions have unspecified behavior outside the C locale.

Here is a proposed patch.


2001-05-16  Paul Eggert  <address@hidden>

        * gawk.texi:
        Say that range expressions have unspecified behavior outside the C
        locale, as per POSIX 1003.1-200x d6.

        Mention that [[:digit:]], [0-9], and [0123456789] are all different.

        Avoid range expressions in examples, so that the examples are
        still portable outside the C locale.

===================================================================
RCS file: gawk.texi,v
retrieving revision 3.0.98.1
retrieving revision 3.0.98.2
diff -pu -r3.0.98.1 -r3.0.98.2
--- gawk.texi   2001/05/16 02:58:02     3.0.98.1
+++ gawk.texi   2001/05/16 08:01:55     3.0.98.2
@@ -3186,10 +3186,14 @@ regular expressions.
 @section Using Character Lists
 
 Within a character list, a @dfn{range expression} consists of two
-characters separated by a hyphen.  It matches any single character that
-sorts between the two characters, using the locale's
-collating sequence and character set.  For example, in the default C
-locale, @samp{[a-dx-z]} is equivalent to @samp{[abcdxyz]}.  Many locales
+characters separated by a hyphen.  In the default C locale, it
+matches any single character that
+sorts between the two characters, using the C locale's
+collation sequence and character set.  For example,
address@hidden is equivalent to @samp{[abcdxyz]}.  In other locales,
+range expressions have unspecified behavior and should be avoided.
+Thus @samp{[0-9]} is not portable outside the
+C locale, since it might match a non-digit.  Also, some locales
 sort characters in dictionary order, and in these locales,
 @samp{[a-dx-z]} is typically not equivalent to @samp{[abcdxyz]}; instead it
 might be equivalent to @samp{[aBbCcDdxXyYz]}, for example.  To obtain
@@ -3303,6 +3307,15 @@ With the POSIX character classes, you ca
 @code{/[[:alnum:]]/} to match the alphabetic
 and numeric characters in your character set.
 
+Outside the C locale, @samp{[[:digit:]]} is not equivalent to
address@hidden (which has unspecified behavior, and should therefore be
+avoided); nor is it equivalent to @samp{[0123456789]} (which matches
+only the standard ten digits).  Typically, you should use
address@hidden:digit:]]} only when the application requires you to match any
+digit character (e.g., digit characters in identifiers), and you should
+use @samp{[0123456789]} when the application requires one of the
+standard ten digits.
+
 @cindex collating elements
 Two additional special sequences can appear in character lists.
 These apply to non-ASCII character sets, which can have single symbols
@@ -16597,7 +16610,7 @@ program:
 @c file eg/lib/readable.awk
 BEGIN @{
     for (i = 1; i < ARGC; i++) @{
-        if (ARGV[i] ~ /^[A-Za-z_][A-Za-z0-9_]*=.*/ \
+        if (ARGV[i] ~ /^[[:alpha:]_][[:alnum:]_]*=.*/ \
             || ARGV[i] == "-")
             continue    # assignment or standard input
         else if ((getline junk < ARGV[i]) < 0) # unreadable
@@ -16648,7 +16661,7 @@ a library file does the trick:
 function disable_assigns(argc, argv,    i)
 @{
     for (i = 1; i < argc; i++)
-        if (argv[i] ~ /^[A-Za-z_][A-Za-z_0-9]*=.*/)
+        if (argv[i] ~ /^[[:alpha:]_][[:alnum:]_]*=.*/)
             argv[i] = ("./" argv[i])
 @}
 
@@ -18554,7 +18567,7 @@ BEGIN @{
         usage()
 
     i = 1
-    if (ARGV[i] ~ /^-[0-9]+$/) @{
+    if (ARGV[i] ~ /^-[0123456789]+$/) @{
         count = -ARGV[i]
         ARGV[i] = ""
         i++
@@ -18873,7 +18886,7 @@ BEGIN   \
         else if (index("0123456789", c) != 0) @{
             # getopt requires args to options
             # this messes us up for things like -5
-            if (Optarg ~ /^[0-9]+$/)
+            if (Optarg ~ /^[0123456789]+$/)
                 fcount = (c Optarg) + 0
             else @{
                 fcount = c + 0
@@ -18883,7 +18896,7 @@ BEGIN   \
             usage()
     @}
 
-    if (ARGV[Optind] ~ /^\+[0-9]+$/) @{
+    if (ARGV[Optind] ~ /^\+[0123456789]+$/) @{
         charcount = substr(ARGV[Optind], 2) + 0
         Optind++
     @}
@@ -19333,7 +19346,7 @@ BEGIN    \
         message = ARGV[2]
     @} else if (ARGC == 3) @{
         message = ARGV[2]
-    @} else if (ARGV[1] !~ /[0-9]?[0-9]:[0-9][0-9]/) @{
+    @} else if (ARGV[1] !~ 
/[0123456789]?[0123456789]:[0123456789][0123456789]/) @{
         print usage1 > "/dev/stderr"
         print usage2 > "/dev/stderr"
         exit 1
@@ -19429,14 +19442,13 @@ The system @command{tr} utility translit
 often used to map uppercase letters into lowercase for further processing:
 
 @example
address@hidden data} | tr 'A-Z' 'a-z' | @var{process data} @dots{}
address@hidden data} | tr '[:upper:]' '[:lower:]' | @var{process data} @dots{}
 @end example
 
 @command{tr} requires two lists of address@hidden some older
 System V systems,
 @command{tr} may require that the lists be written as
-range expressions enclosed in square brackets (@samp{[a-z]}) and quoted,
-to prevent the shell from attempting a @value{FN} expansion.  This is
+range expressions enclosed in square brackets (@samp{[[:lower:]]}).  This is
 not a feature.}  When processing the input, the first character in the
 first list is replaced with the first character in the second list,
 the second character in the first list is replaced with the second
@@ -19505,7 +19517,7 @@ Finally, the processing rule simply call
 @c endfile
 @end ignore
 @c file eg/prog/translate.awk
-# Bugs: does not handle things like: tr A-Z a-z, it has
+# Bugs: does not handle things like: tr '[:upper:]' '[:lower:]', it has
 # to be spelled out. However, if `to' is shorter than `from',
 # the last character in `to' is used for the rest of `from'.
 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]