Re: rx.el sexp regexp syntax (WAS: Off Topic)

From: Pierre Neidhardt
Subject: Re: rx.el sexp regexp syntax (WAS: Off Topic)
Date: Fri, 25 May 2018 18:47:59 +0200
Alan Mackenzie <address@hidden> writes:

>> rx.el is one of the best concepts I've discovered in a long time.
>> It's another instance of "Don't come up with a new (mini)language when
>> Lisp can do better": it's easier to learn, more flexible, easier to
>> write, much easier to read and as a consequence much more maintainable.
> Much easier than what?  Than the putative mini-language that doesn't get
> written?

I meant that in my opinion rx is easier to write than regexps.  That it
is not popular is the root of the question here.

>> I think it's high time we moved away from traditional regexps and
>> embraced the concept of rx.el.  I'm thinking of implementing it for
>> Guile.
> There's nothing stopping anybody from using rx.el.  However, people have
> mostly _not_ used it.  The "I think it's high time ...." suggests in
> some way forcing people to use it.  Before mandating something like
> this, I think we should find out why it's not already in common use.

Sorry if you felt I was forcing, that wasn't my intention.  I was
referring to the long period regexps have been around.

I thought the reason it's not already in common use had already been
discussed: it's barely referenced anywhere, it needs more advertising.

Correct me if this is wrong.

>> At the moment the rx.el implementation is built on top of Emacs regexps
>> which are implemented in C.  I believe this does not use the power of
>> Lisp as much as it could.
> But would any alternative use the power of regexps?

Yes, rx.el is a drop-in replacement of regexps.  What do you mean?

> Emacs has a (moderately large) cache of regexps, so that building the
> automatons is done very rarely.  Possibly just once each for each
> session of Emacs.

That's the whole point: if possible (see below), remove the requirements
for regexp cache management.

>> In high-level languages, automatons are automatically cached to save the
>> cost of building them.
> Emacs Lisp does this too.

I did not exclude it :)

>> The rx.el library/concept could alleviate this issue altogether: because
>> we express the automaton directly in Lisp, the parsing step is not
>> needed and thus the building cost could be tremendously reduced.
>> So the rx.el building steps
>>   rx expression -> regexp string -> C regexp automaton
>> could boil down to simply
>>   rx automaton
> I don't see what you're trying to save, here.  At some stage, the regexp
> source, in whatever form, needs to be converted to an automaton.

Yes, that's what I meant with "rx automaton".  My suggestion (not
necessarily for Emacs Lisp) is to remove the step that converts the rx
symbolic automaton to a string, and the conversion from a string to the
actual automaton.

> Are you suggesting here building an interpreter in Lisp directly to
> execute rx expressions?

Yes, but maybe in Guile or some other Lisp.  Don't know if it's feasible
in Emacs Lisp.

>> It would be interesting to compare the performance.  This also means
>> that there would be no need for caching on behalf of the supporting
>> language.
> I will predict that an rx interpreter built in Lisp will be two orders
> of magnitude slower than the current regexp machine, where both the
> construction of an automaton, and the byte-code interpreter which runs
> it are written in C (and probably quite optimised C at that).

Obviously, and this is the prime reason why the author of rx.el
implemented it on top of C regexp.  My point was that with a fast Lisp
(or a specifically designed C support), a Lisp automaton would be just
as fast: the Lisp code would directly map the equivalent C automaton.

Again, I have no clue if that's doable in Emacs Lisp.

> I can't get excited about rx syntax, which I'm sure would be just as
> tedious, and possibly more difficult to read than a standard regexp.

Have you used rx?  The whole point of the library is to increase
readability, and it does a great job at it in my opinion.

> Analagously, as a musician, I read standard musical notation (with
> sets of five lines and dots) far more easily and fluently than I could
> any "simplified" system designed for beginners, which would be bloated
> by comparison.

rx.el is meant to be "simplified for beginners".  You could also reverse
the analogy in saying that regexps are the "simplified version for
beginners"... The analogy does not map very well.

A better analogy would be the mapping between assembly and the
hexadecimal codes of CPU instructions: I don't think many people find
hexedecimal codes more explicit than assembly verbs and symbols
(although most assembly languages abuse abbreviations, but the
intention is there).

> Regular expressions can be difficult.  I don't believe this difficulty
> lies, in the main, in the compact notation used to express them.  Rather
> it lies in the concepts and the semantics of the regexp elements, and
> being able to express a "mental automaton" in regexp semantics.

The semantic between rx and regexp does not differ.  It's purely

Let's consider some points:

- rx can be written over multiple lines and indented.  This is a great
  readibility booster for groups, which can be _grouped_ together with
  linebreaks and indentation.

- rx does not require escaping any character with backslashes.  This
  is always a great source of confusion when switching from BRE to ERE,
  between different interpreters and when storing regexp in Lisp strings
  where backslashes must be escaped themselves for instance.

- Symbols with non-trivial meanings in regexp (e.g. \<, :, ^, etc.) have
  a trivial _English_ counterpart in rx: (respectively "word-start",
  nothing, "line-start" _and_ "not").

- No more special-case symbols like "-" for ranges or "^" (negation when
  first character in square brackets).  Thus less cognitive burden.

- The "^" has a double-meaning in regexp: "line-start" and "not".

The list goes on.

Pierre Neidhardt

