[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#37659: rx additions: anychar, unmatchable, unordered-or
From: |
Mattias Engdegård |
Subject: |
bug#37659: rx additions: anychar, unmatchable, unordered-or |
Date: |
Wed, 23 Oct 2019 11:15:47 +0200 |
22 okt. 2019 kl. 19.33 skrev Paul Eggert <eggert@cs.ucla.edu>:
>> Thus, instead of 'unordered-or', define the operator in terms of long
>> matches: 'or-max' (working name) would work like 'or' but guarantee a
>> longest match, and only permit strings and 'or-max' forms as arguments.
>
> That's an odd restriction. I'm not sure it's a good idea to add an operator
> with such a restriction. That is, I know why the restriction is there (it's
> because of limitations in the Emacs regexp matcher), but it's not clear that
> users should have to know and understand these details.
The restriction is simple and easy to document. It is not necessary to know the
underlying reason for it in order to use the construct effectively.
> Moreover, if greed is the longstanding tradition for regexp-opt, shouldn't
> plain "or" be greedy, to be consistent with other operators?
Yes, I very much favour switching to a DFA engine; is there another way? Even
then a backtracking engine would be needed for backrefs and other messy cases.
However, that's a completely different amount of work. (Meanwhile, we have
'posix-string-match' etc for those who want greed at any cost.)
The problem that I'm trying to solve here is: how do we make it easy to match
one of multiple strings --- keywords, say --- in rx? Currently, the answer is
something like (regexp (regexp-opt my-keywords)), which doesn't integrate well
with rx user definitions. In addition, the output of one regexp-opt cannot be
used as input to another.
'or-max' would allow a user to say
(rx-define veggies (or-max "carrot" "tomato" "cucumber"))
(rx-define meats (or-max "beef" "chicken" "pork"))
... (rx (or-max veggies meats)) ...
and get a regexp that is guaranteed to be greedy, well-optimised as if all
strings were passed to 'regexp-opt' at once, and robust: a small change won't
change the behaviour radically, and the user won't have to game or second-guess
the engine in order to produce the desired result.
If, in the future, 'or' becomes greedy, then 'or-max' will just be a synonym.
> If it's too much trouble to make plain "or" greedy, I suggest just
> documenting it as possibly being greedy and possibly not (that is, document
> it as being unordered, even if it happens to be ordered now). This will give
> us more opportunity for optimization later.
That would make rx strictly less useful than string regexps. That is why
'unordered-or' was a mistake: the unpredictability made it useless in many
cases, and everyone would just have used regexp-opt (or skipped rx altogether).
It is desirable to have the semantics for 'or' in rx and \| in string regexps;
otherwise, translating and understanding become unnecessarily difficult.
We could say that 'or' and \| either match greedily or in left-to-right order.
However, I'm not sure this solves any problem right now.
- bug#37659: rx additions: anychar, unmatchable, unordered-or, Mattias Engdegård, 2019/10/08
- bug#37659: rx additions: anychar, unmatchable, unordered-or, Mattias Engdegård, 2019/10/09
- bug#37659: Mattias Engdegård <mattiase <at> acm.org>, Paul Eggert, 2019/10/11
- bug#37659: Mattias Engdegård <mattiase <at> acm.org>, Mattias Engdegård, 2019/10/12
- bug#37659: Mattias Engdegård <mattiase <at> acm.org>, Paul Eggert, 2019/10/13
- bug#37659: Mattias Engdegård <mattiase <at> acm.org>, Mattias Engdegård, 2019/10/13
- bug#37659: rx additions: anychar, unmatchable, unordered-or, Mattias Engdegård, 2019/10/22
- bug#37659: rx additions: anychar, unmatchable, unordered-or, Robert Pluim, 2019/10/22
- bug#37659: rx additions: anychar, unmatchable, unordered-or, Paul Eggert, 2019/10/22
- bug#37659: rx additions: anychar, unmatchable, unordered-or,
Mattias Engdegård <=
- bug#37659: rx additions: anychar, unmatchable, unordered-or, Paul Eggert, 2019/10/23
- bug#37659: rx additions: anychar, unmatchable, unordered-or, Drew Adams, 2019/10/23
- bug#37659: rx additions: anychar, unmatchable, unordered-or, Mattias Engdegård, 2019/10/24
- bug#37659: rx additions: anychar, unmatchable, unordered-or, Drew Adams, 2019/10/24
- bug#37659: rx additions: anychar, unmatchable, unordered-or, Phil Sainty, 2019/10/24
- bug#37659: rx additions: anychar, unmatchable, unordered-or, Drew Adams, 2019/10/24
- bug#37659: rx additions: anychar, unmatchable, unordered-or, Mattias Engdegård, 2019/10/24
- bug#37659: rx additions: anychar, unmatchable, unordered-or, Mattias Engdegård, 2019/10/27