bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] print statement problem with hexadecimal escape sequence


From: Nelson H. F. Beebe
Subject: Re: [bug-gawk] print statement problem with hexadecimal escape sequence
Date: Thu, 3 May 2018 15:28:19 -0600

> > $ awk 'BEGIN { print "\x41\x42\x43FFF" }'  | od -a
> > 0000000   A   B del  nl
> > 0000004

A test on my numerous versions of gawk shows that the behavior changed
between gawk-4.1.4 and gawk-4.1.60.  The latter, and later versions,
stop collecting at two hex digits.

The C89 and C99 Standards say that octal and hex character constants
in strings are handled by collecting the longest sequence of octal or
hex (respectively).  However, they go on to note a constraint that the
value of such a sequence shall be in the range of the unsigned char
type.

Thus, for the C language, the sample string "\x43FFF" violates the
constraint, even though it is legal according to the LR(1) grammar for
the language.

The 2001 IEEE POSIX Standard for awk differs: it does not include hex
escapes, and for octal escapes, limits them to 3 digits, and then
further says that generation of a NUL character results in undefined
behavior.

The current version of the gawk manual says:

        '\NNN'
             The octal value NNN, where NNN stands for 1 to 3 digits between '0'
             and '7'.  For example, the code for the ASCII ESC (escape)
             character is '\033'.

        '\xHH...'
             The hexadecimal value HH, where HH stands for a sequence of
             hexadecimal digits ('0'-'9', and either 'A'-'F' or 'a'-'f').  A
             maximum of two digts are allowed after the '\x'.  Any further
             hexadecimal digits are treated as simple letters or numbers.
             (c.e.)  (The '\x' escape sequence is not allowed in POSIX awk.)

                  CAUTION: In ISO C, the escape sequence continues until the
                  first nonhexadecimal digit is seen.  For many years, 'gawk'
                  would continue incorporating hexadecimal digits into the value
                  until a non-hexadecimal digit or the end of the string was
                  encountered.  However, using more than two hexadecimal digits
                  produced undefined results.  As of version 4.2, only two
                  digits are processed.

To make your gawk code immune to the change in gawk's behavior, write
the hex constants in one string, and then concatenate with another
string:  replace 

        "\x41\x42\x43FFF" 

by 

        "\x41\x42\x43" "FFF"

if you want its value to be equivalent to "ABCFFF".

Of course, that may not be possible in all cases, but I suspect that
the trick would suffice except where string characters are being
generated on-the-fly.


-------------------------------------------------------------------------------
- Nelson H. F. Beebe                    Tel: +1 801 581 5254                  -
- University of Utah                    FAX: +1 801 581 4148                  -
- Department of Mathematics, 110 LCB    Internet e-mail: address@hidden  -
- 155 S 1400 E RM 233                       address@hidden  address@hidden -
- Salt Lake City, UT 84112-0090, USA    URL: http://www.math.utah.edu/~beebe/ -
-------------------------------------------------------------------------------



reply via email to

[Prev in Thread] Current Thread [Next in Thread]