[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: quote removal issues within character class
From: |
Oğuz |
Subject: |
Re: quote removal issues within character class |
Date: |
Sat, 9 Nov 2019 16:45:31 +0200 |
You've already answered it, thank you. I didn't know that [:, [., [= were
special *sequences*, I guess I overlooked that part. Thanks again for
taking time to explain it in detail, I'm grateful
9 Kasım 2019 Cumartesi tarihinde Robert Elz <kre@munnari.oz.au> yazdı:
> Date: Sat, 9 Nov 2019 07:35:16 +0300
> From: =?UTF-8?B?T8SfdXo=?= <oguzismailuysal@gmail.com>
> Message-ID: <
> CAH7i3Lr68CiVXLR9_HoOgQa7Vd-zyVZ+fck-0K3uQPTNSirU2Q@mail.gmail.com>
>
> | is correct, as "foo" does not contain a ']' which would be required
> | > to match there (quoting the ':' means there is no character class,
> | > hence we have instead (the negation of) a char class containing '['
> ':'
> | > 'l' 'o' 'w' 'e' ';r' (and ':' again), preceded by anything, and
> | > followed by ']' and anything. foo does not match. f]oo would.
> | >
> |
> | where exactly is this documented in the standard?
>
> I'm not sure which part exactly you're looking for, but char sets in sh
> are specified to be the same as in REs, except that ! replaces ^ as the
> negation character (that's in XCU 2.13.1). Char sets (bracket expressions)
> in RE's are documented in XBD 9.3.5 wherein it states
>
> A bracket expression is either a matching list expression or a
> non-matching list expression. It consists of one or more
> expressions:
> ordinary characters, collating elements, collating symbols,
> equivalence classes, character classes, or range expressions.
> The <right-square-bracket> (']') shall lose its special meaning and
> represent itself in a bracket expression if it occurs first in the
> list
> (after an initial <circumflex> ('^'), if any).
>
> Otherwise, it shall terminate the bracket expression,
>
> That is, a ']' that occurs anywhere else terminates the bracket expression
> except:
>
> unless it appears in a collating symbol (such as "[.].]")
>
> (not relevant in the given example)
>
> or is the ending <right-square-bracket> for a collating symbol,
> equivalence class, or character class.
>
> So the ']' that immediately follows the second ':' would not terminate the
> bracket expression if it is the ending ']' for a character class
> (collating symbols and equiv classes not being relevant to the example).
> Of course, that can only happen if there is a character class to end.
>
> There's also
>
> The special characters '.', '*', '[', and '\\'
> (<period>, <asterisk>, <left-square-bracket>, and <backslash>,
> respectively) shall lose their special meaning within a bracket
> expression.
>
> whereupon if the [": sequence does not start a char class, the '[' there
> is simply a literal char inside the bracket expression.
>
> Similarly if the bracket expression ends at the first ']' (the one
> imediately
> after the second ':') the following ']' is simply a literal character, as
> ']' chars are special only when following a '['.
>
> So, all that's left to determine is whether the [": sequence can be
> considered as beginning a char class.
>
> In a RE it certainly cannot - quote chars (' and ") are not special in
> REs at all, and [": is no different syntatically than [x: which no-one
> would treat as being the introduction to a char class.
>
> This is also, I believe (Chet can confirm, or refute, if he desires) where
> bash gets the interpretation that "lower" (including the quotes) is the
> name of the char class in [:"lower":] except that it cannot be, as char
> class names cannot contain quote characters (which should lead to the
> whole sub-expression not being treated as a char class at all, instead
> bash treats it, I think, as if it were an unknown but valid class name).
>
> But when it comes from sh, quote chars are "different" and instead of
> just being characters, they instead affect the interpretation of the
> characters that are quoted. See XCU 2.2:
>
> Quoting is used to remove the special meaning of certain characters
> or words to the shell.
>
> Quoting can be used to preserve the literal meaning of the special
> characters in the next paragrapyh [...]
>
> and the following may need to be quoted under certain
> circumstances.
> That is, these characters may be special depending on conditions
> described elsewhere in this volume of POSIX.1-2017:
>
> * ? [ # ~ = %
>
> to which more chars have been added (as I recall) recently by some
> Austin Group correction (which I think includes ! : - and ]), that is
> to make it clear, that in sh
>
> [a'-'z]
>
> is a bracket expression containing 3 chars 'a' '-' and 'z' (which form
> of quoting is used to remove the specialness of the '-' is irrelevant).
> and that "[a-z]" isn't a bracket expression at all (neither of which
> is true in an RE - though the role of \ in RE's is being altered slightlty
> so if it had been [a\-z] in a RE things are less clear.)
>
> The effect of this is that in sh, in an expression like
>
> [![":lower":]]
>
> the first ':' is not "special" and hence cannot form part of the
> magic opening '[:' sequence for a character class. Hence this
> expression contains no character class, and consequently the
> ':]' chars are simply a ':' in the bracket expression, and then
> the terminating ']' - which leaves the second ']' being just a
> literal character.
>
>
> While here (these following parts are not relevant to your question I
> believe)
> when used in sh
>
> [[:"lower":]]
>
> should be treated just the same as
>
> [[:lower:]]
>
> for the same reason that
>
> ["abc"]
>
> is treated the same as
>
> [abc]
>
> That is, quoted characters that are not special are no different
> than the same character unquoted. That's universal in sh, quoting
> removes special meaning (of lots of things) but where there was none
> the quoting changes nothing at all, eg:
>
> "ls" \-'l'
>
> is exactly the same as
>
> ls -l
>
> and
> x="foo" y=''
> is identical to
> x=foo y=
> (though not all empty quoted strings are irrelevant that way).
>
> There are other issues that are less clear what should happen, if your
> example had been
>
> [![:"lower:"]]
>
> then we get into very murky water indeed. XBD 9.3.5 says:
>
> The character sequences "[.", "[=", and "[:" (<left-square-bracket>
> followed by a <period>, <equals-sign>, or <colon>) shall be special
> inside a bracket expression
>
> [aside: not related to my current point, the "shall be special" is what
> enables sh quoting to stop that from happening, since quoting in the shell
> prevents specialness from happening]
>
> and are used to delimit collating symbols, equivalence class
> expressions, and character class expressions.
>
> That part (so far) is clear and non-controversial.
>
> These symbols shall be followed by a valid expression and the
> matching terminating sequence ".]", "=]", or ":]", as described
> in the following items.
>
> That's the part that is less clear. When a valid expression and the
> terminating sequence appear, there is no issue, and all is fine - what
> is less clear is what happens when one of those reqirements is not met.
>
> Some read this as purely a reqirement on the application - what the
> script writer is required to do - and when they don't the implementation
> (sh or RE library, or whatever) is free to interpret things (which means
> the whole pattern) however it likes (often as not being a pattern at all).
>
> Personally I disagree - I believe it is a requirement on the application
> if it desires the relevant sequence to be interpreted as a char class (etc)
> and if the application does not include a valid expression or terminating
> sequence the implementation should be required to treat the opening
> char sequence as if it did not begin a char class (etc) and the [: were
> simply 2 chars contained in the bracket expression (they must be in
> a bracket expression or the issue doesn't arise at all).
>
> Unfortunately (for the world in general, in that more and more of this
> is becoming unspecified, which makes it harder and harder to know what
> any particular sequence of characters will do) it seems like the former
> interpretation is the more likely to be adopted.
>
> If I have not understoood the "this" in your
>
> where exactly is this documented
>
> please be more precise, and I will try to answer.
>
> kre
>
>
Re: quote removal issues within character class, Chet Ramey, 2019/11/13