[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: doc tweak re backslashes in bracket expressions
From: |
arnold |
Subject: |
Re: doc tweak re backslashes in bracket expressions |
Date: |
Thu, 07 Nov 2024 02:55:02 -0700 |
User-agent: |
Heirloom mailx 12.5 7/5/10 |
Hi Ed,
Here you go:
$ for i in nawk mawk gawk mksawk "busybox awk" wak goawk
> do echo ========== $i
> echo 'ab]cd' | $i '/[\]]/'
> done
========== nawk
ab]cd
========== mawk
ab]cd
========== gawk
ab]cd
========== mksawk
========== busybox awk
========== wak
========== goawk
ab]cd
ALL of the above awks match literal accept and match \^ and \-
inside brackets.
I understand that at work you can't install arbitrary programs,
but as an awk proponent who wishes to contribute, you should
install other variants at home. There are links in the gawk
manual for just about all of the variants I just ran.
Thanks,
Arnold
Ed Morton <mortoneccc@comcast.net> wrote:
> Thanks Arnold. Regarding opening a ticket against POSIX - I wouldn't
> mind doing that (I have a few related others currently open) but
> unfortunately I don't have any other version of awk to test with and I
> don't know how other awks behave regarding \], \- or \^ inside a bracket
> expression. Do you know if all modern awks (e.g. not old awk and not
> mawk1) treat those as literal when escaped anywhere inside a bracket
> expression? Are there any other escape sequences you're aware of, inside
> or outside of bracket expressions, I should also cover in the ticket?
>
> Ed.
>
> On 11/3/2024 11:34 PM, arnold@skeeve.com wrote:
> > Hi Ed.
> >
> > I am adding something to the manual. IMHO it's a bug in the standard
> > that \] isn't mentioned; I suggest opening a ticket on it.
> >
> > Thanks,
> >
> > Arnold
> >
> > Ed Morton via "Bug reports only for gawk."<bug-gawk@gnu.org> wrote:
> >
> >> I finally came across the POSIX reference that allows awk to interpret
> >> ```\``` in a bracket expression as an escape character - it's in
> >> https://pubs.opengroup.org/onlinepubs/9799919799/utilities/awk.html#tag_20_06_13_04:
> >>
> >>> these escape sequences shall be recognized both inside and outside
> >>> bracket expressions.
> >>>
> >>> *Escape Sequence*
> >>>
> >>>
> >>>
> >>> *Description*
> >>>
> >>>
> >>>
> >>> *Meaning*
> >>>
> >>> \\
> >>>
> >>>
> >>>
> >>> Two <backslash> characters.
> >>>
> >>>
> >>>
> >>> In the lexical token *ERE*, the sequence shall represent itself. In
> >>> the lexical token *STRING*, it shall represent a single <backslash>.
> >>>
> >>> \c
> >>>
> >>>
> >>>
> >>> A <backslash> character followed by any character not described in
> >>> this table or in the table in XBD /5. File Format Notation/
> >>> <https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap05.html#tag_05>
> >>>
> >>> ('\\', '\a', '\b', '\f', '\n', '\r', '\t', '\v').
> >>>
> >>>
> >>>
> >>> Undefined
> >>>
> >> so inside or outside of a bracket expression `\\` has to mean `\` and
> >> the meaning of `\c` where `c` is any "ordinary character" is undefined
> >> by POSIX and so gawk can treat it however it likes, hence allowing
> >> `[a\]]` to mean "a or ]", for example. So, I'd consider that second part
> >> more "allowed by" rather than "mandated by" POSIX (POSIX doesn't mandate
> >> what `[a\]]` means to gawk) but maybe that's just nitpicking. I'd like
> >> to see the doc add that reference though as it took me hours wading
> >> through POSIX awk and regexp specs to find it.
> >>
> >> Ed.
> >>
> >> On 11/3/2024 7:50 AM, Ed Morton via Bug reports only for gawk. wrote:
> >>> Just a small tweak suggestion for the gawk documentation regarding
> >>> backslashes inside bracket expressions.
> >>>
> >>> https://www.gnu.org/software/gawk/manual/html_node/Bracket-Expressions.html
> >>>
> >>> currently says (**emphasis mine**):
> >>>
> >>>> The treatment of ‘\’ in bracket expressions is compatible with other
> >>>> awk implementations **and is also mandated by POSIX**.
> >>> but POSIX, at least this 2024 incarnation of the spec, seems pretty
> >>> clear (see references below*) that a backslash inside a bracket
> >>> expression is not an escape character so per POSIX these would be
> >>> compliant behavior:
> >>>
> >>>> $ printf 'a\\d\n' | grep -E '[\]'
> >>>> a\d
> >>>> $ printf 'a\\d\n' | sed -En '/[\]/p'
> >>>> a\d
> >>> while these would not:
> >>>
> >>>> $ printf 'a\\d\n' | awk '/[\]/'
> >>>> awk: cmd. line:1: /[\]/
> >>>> awk: cmd. line:1: ^ unterminated regexp
> >>>> $ printf 'a\\d\n' | awk --posix '/[\]/'
> >>>> awk: cmd. line:1: /[\]/
> >>>> awk: cmd. line:1: ^ unterminated regexp
> >>> so maybe either remove that "and is also mandated by POSIX" statement
> >>> or provide a reference to where that behavior IS mandated by POSIX to
> >>> clear up any confusion.
> >>>
> >>> Ed.
> >>>
> >>> *From the current, 2024, POSIX regexp spec,
> >>> https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap09.html
> >>> (**emphasis mine**):
> >>>
> >>>>> [9.1 Regular Expression
> >>>> Definitions](https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap09.html#tag_09_01)
> >>>>
> >>>>> ...
> >>>>> escape sequence
> >>>>>
> >>>>> The escape character followed by any single character, which is
> >>>>> thereby "escaped". The escape character is a \<backslash\> that is
> >>>>> **neither in a bracket expression** nor itself escaped.
> >>> which tells us that a backslash within a bracket expression is not an
> >>> escape character, and this:
> >>>
> >>>>> [9.3.5 RE Bracket
> >>>> Expression](https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap09.html#tag_09_03_05)
> >>>>> ... When the bracket
> >>>>> expression appears within an ERE, the special characters ... and
> >>>> '```\```' (...
> >>>>> and \<backslash\>, respectively) shall **lose their special meaning
> >>>> within
> >>>>> the bracket expression**
> >>> which reiterates that a backslash within a bracket expression has no
> >>> special meaning, and there's nothing I can see in [the POSIX awk
> >>> spec](https://pubs.opengroup.org/onlinepubs/9799919799/utilities/awk.html)
> >>> to override the above definitions.