[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: rosie/libpexl library for regex pattern composition
From: |
Danny McClanahan |
Subject: |
Re: rosie/libpexl library for regex pattern composition |
Date: |
Mon, 29 Jul 2024 13:58:13 +0000 |
> From: Helmut Eller <eller.helmut@gmail.com>
> Cc: "emacs-devel@gnu.org" <emacs-devel@gnu.org>
> Date: Sun, 28 Jul 2024 09:08:06 +0200
>
> AFAIU, Rosie is based on parsing expression grammars, i.e. it's not a
regular language and it's in a different complexity class.
That's true! However, the PEXL project (essentially Rosie v2) is actually
looking to move past the PEG formalism for being insufficiently expressive as
well as presenting some implementation issues, particularly with respect to
performance (see brief description at
https://gitlab.com/pexlang/libpexl#libpexl-is-a-work-in-progress-library-for-text-matching-and-simple-parsing).
> On Sunday, July 28th, 2024 at 03:51, Eli Zaretskii <eliz@gnu.org> wrote:
>
> Tree-sitter accesses buffer text one character at a time, for which an
> API is very simple, and we already have such an API modules could use:
> char-after. But I very much doubt that libraries like Rosie could do
> well with such APIs for its needs.
One of the improvements PEXL has over Rosie v1 is the `find` VM instruction
(described in the above-linked README), which uses SIMD to search many bytes
ahead at once instead of iterating char-by-char. This functionality something I
am also hoping to demonstrate with my regex-emacs v2 engine, although I am
working on a char-by-char NFA first. IIUC, regex engines tend to employ SIMD
searching as a "prefilter" (see discussion in
https://github.com/BurntSushi/rebar/blob/master/README.md) before employing the
char-by-char matchers, which may be easier to separate out into some explicit
multi-phase search API if needed.
This is mostly to say that both regex engines using automata as well as Rosie
could likely be made to work with a char-by-char API alone, although it would
obviously hamper some optimization opportunities. It may be an easier way to
prototype new matching engines than building them within Emacs.
> The alternative is to give modules
> direct access to the entire buffer text, which IMO is unthinkable.
I had not considered the possibility of exposing a module API for extensible
parsing engines as opposed to building it directly into Emacs like the current
regex-emacs engine, but I definitely appreciate the idea of decoupling from a
specific framework, especially with an eye to reducing the maintenance burden.
From my current understanding, the benefits of building Rosie/PEXL directly
into Emacs would be:
(1) could take advantage of e.g. SIMD search by directly accessing the gap
buffer memory,
(2) since it would be reliably available to all Emacs users, could form a
composeable elisp search API.
My impression from prior discussion on this list was that one long-term desire
for Emacs lisp was a lisp-level API to explicitly represent and compose search
patterns, vs e.g. the implicit caching performed for matchers from
regex-emacs.c (currently represented to elisp developers as lisp strings). If
that is a goal, it seems difficult to realize that without building the
matching logic and the pattern compiler into Emacs itself (please correct me if
I'm wrong about this).
However, if *that* is the goal, then it also seems appropriate to prototype
such a lisp-level API first, before introducing external code like Rosie/PEXL;
at a first glance, it should be possible to do this with the current
regex-emacs.c. Once such a layer is established, we could even imagine exposing
a module API that allows for pluggable parsing frameworks. If Rosie/PEXL offers
significant benefits and is easy enough to integrate, we could then compare the
pros and cons of building it in directly (and then offering a more expressive
lisp-level matching API with its power) vs exposing a module API (which may be
easier to maintain).
We would then be comparing the maintenance burden of pluggability over the
maintenance burden of integrating Rosie/PEXL, but I believe the lisp-level
matcher API (with explicit construction over implicit caching) can be
prototyped right now without that bikeshedding, and seems to be something Emacs
desires in any case.
Does that correctly outline the goals/desires for lisp-level matcher
construction? I like the idea of figuring out that API first, since all it
needs to start is "just" an explicit lisp object for compiled regexps, which
alone was mentioned as a goal in prior discussions. I think in any case having
established that API surface will definitely make it easier to integrate other
regex/parsing frameworks, be they built into Emacs or via external module. As
it stands, I would like to have made this relatively straightforward
improvement to Emacs before proposing the integration of other more complex
external dependencies.
Thanks,
Danny