[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] print statement problem with hexadecimal escape sequence
From: |
Nelson H. F. Beebe |
Subject: |
Re: [bug-gawk] print statement problem with hexadecimal escape sequence |
Date: |
Thu, 3 May 2018 15:28:19 -0600 |
> > $ awk 'BEGIN { print "\x41\x42\x43FFF" }' | od -a
> > 0000000 A B del nl
> > 0000004
A test on my numerous versions of gawk shows that the behavior changed
between gawk-4.1.4 and gawk-4.1.60. The latter, and later versions,
stop collecting at two hex digits.
The C89 and C99 Standards say that octal and hex character constants
in strings are handled by collecting the longest sequence of octal or
hex (respectively). However, they go on to note a constraint that the
value of such a sequence shall be in the range of the unsigned char
type.
Thus, for the C language, the sample string "\x43FFF" violates the
constraint, even though it is legal according to the LR(1) grammar for
the language.
The 2001 IEEE POSIX Standard for awk differs: it does not include hex
escapes, and for octal escapes, limits them to 3 digits, and then
further says that generation of a NUL character results in undefined
behavior.
The current version of the gawk manual says:
'\NNN'
The octal value NNN, where NNN stands for 1 to 3 digits between '0'
and '7'. For example, the code for the ASCII ESC (escape)
character is '\033'.
'\xHH...'
The hexadecimal value HH, where HH stands for a sequence of
hexadecimal digits ('0'-'9', and either 'A'-'F' or 'a'-'f'). A
maximum of two digts are allowed after the '\x'. Any further
hexadecimal digits are treated as simple letters or numbers.
(c.e.) (The '\x' escape sequence is not allowed in POSIX awk.)
CAUTION: In ISO C, the escape sequence continues until the
first nonhexadecimal digit is seen. For many years, 'gawk'
would continue incorporating hexadecimal digits into the value
until a non-hexadecimal digit or the end of the string was
encountered. However, using more than two hexadecimal digits
produced undefined results. As of version 4.2, only two
digits are processed.
To make your gawk code immune to the change in gawk's behavior, write
the hex constants in one string, and then concatenate with another
string: replace
"\x41\x42\x43FFF"
by
"\x41\x42\x43" "FFF"
if you want its value to be equivalent to "ABCFFF".
Of course, that may not be possible in all cases, but I suspect that
the trick would suffice except where string characters are being
generated on-the-fly.
-------------------------------------------------------------------------------
- Nelson H. F. Beebe Tel: +1 801 581 5254 -
- University of Utah FAX: +1 801 581 4148 -
- Department of Mathematics, 110 LCB Internet e-mail: address@hidden -
- 155 S 1400 E RM 233 address@hidden address@hidden -
- Salt Lake City, UT 84112-0090, USA URL: http://www.math.utah.edu/~beebe/ -
-------------------------------------------------------------------------------