emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: rosie/libpexl library for regex pattern composition


From: Danny McClanahan
Subject: Re: rosie/libpexl library for regex pattern composition
Date: Mon, 29 Jul 2024 13:58:13 +0000

> From: Helmut Eller <eller.helmut@gmail.com>
> Cc: "emacs-devel@gnu.org" <emacs-devel@gnu.org>
> Date: Sun, 28 Jul 2024 09:08:06 +0200
>
> AFAIU, Rosie is based on parsing expression grammars, i.e. it's not a
regular language and it's in a different complexity class.

That's true! However, the PEXL project (essentially Rosie v2) is actually 
looking to move past the PEG formalism for being insufficiently expressive as 
well as presenting some implementation issues, particularly with respect to 
performance (see brief description at 
https://gitlab.com/pexlang/libpexl#libpexl-is-a-work-in-progress-library-for-text-matching-and-simple-parsing).

> On Sunday, July 28th, 2024 at 03:51, Eli Zaretskii <eliz@gnu.org> wrote:
>
> Tree-sitter accesses buffer text one character at a time, for which an
> API is very simple, and we already have such an API modules could use:
> char-after. But I very much doubt that libraries like Rosie could do
> well with such APIs for its needs.

One of the improvements PEXL has over Rosie v1 is the `find` VM instruction 
(described in the above-linked README), which uses SIMD to search many bytes 
ahead at once instead of iterating char-by-char. This functionality something I 
am also hoping to demonstrate with my regex-emacs v2 engine, although I am 
working on a char-by-char NFA first. IIUC, regex engines tend to employ SIMD 
searching as a "prefilter" (see discussion in 
https://github.com/BurntSushi/rebar/blob/master/README.md) before employing the 
char-by-char matchers, which may be easier to separate out into some explicit 
multi-phase search API if needed.

This is mostly to say that both regex engines using automata as well as Rosie 
could likely be made to work with a char-by-char API alone, although it would 
obviously hamper some optimization opportunities. It may be an easier way to 
prototype new matching engines than building them within Emacs.

> The alternative is to give modules
> direct access to the entire buffer text, which IMO is unthinkable.

I had not considered the possibility of exposing a module API for extensible 
parsing engines as opposed to building it directly into Emacs like the current 
regex-emacs engine, but I definitely appreciate the idea of decoupling from a 
specific framework, especially with an eye to reducing the maintenance burden. 
From my current understanding, the benefits of building Rosie/PEXL directly 
into Emacs would be:
(1) could take advantage of e.g. SIMD search by directly accessing the gap 
buffer memory,
(2) since it would be reliably available to all Emacs users, could form a 
composeable elisp search API.

My impression from prior discussion on this list was that one long-term desire 
for Emacs lisp was a lisp-level API to explicitly represent and compose search 
patterns, vs e.g. the implicit caching performed for matchers from 
regex-emacs.c (currently represented to elisp developers as lisp strings). If 
that is a goal, it seems difficult to realize that without building the 
matching logic and the pattern compiler into Emacs itself (please correct me if 
I'm wrong about this).

However, if *that* is the goal, then it also seems appropriate to prototype 
such a lisp-level API first, before introducing external code like Rosie/PEXL; 
at a first glance, it should be possible to do this with the current 
regex-emacs.c. Once such a layer is established, we could even imagine exposing 
a module API that allows for pluggable parsing frameworks. If Rosie/PEXL offers 
significant benefits and is easy enough to integrate, we could then compare the 
pros and cons of building it in directly (and then offering a more expressive 
lisp-level matching API with its power) vs exposing a module API (which may be 
easier to maintain).

We would then be comparing the maintenance burden of pluggability over the 
maintenance burden of integrating Rosie/PEXL, but I believe the lisp-level 
matcher API (with explicit construction over implicit caching) can be 
prototyped right now without that bikeshedding, and seems to be something Emacs 
desires in any case.

Does that correctly outline the goals/desires for lisp-level matcher 
construction? I like the idea of figuring out that API first, since all it 
needs to start is "just" an explicit lisp object for compiled regexps, which 
alone was mentioned as a goal in prior discussions. I think in any case having 
established that API surface will definitely make it easier to integrate other 
regex/parsing frameworks, be they built into Emacs or via external module. As 
it stands, I would like to have made this relatively straightforward 
improvement to Emacs before proposing the integration of other more complex 
external dependencies.

Thanks,
Danny



reply via email to

[Prev in Thread] Current Thread [Next in Thread]