[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Regexp bytecode disassembler

From: Mattias Engdegård
Subject: Re: Regexp bytecode disassembler
Date: Sat, 21 Mar 2020 17:52:51 +0100

20 mars 2020 kl. 13.58 skrev Andreas Schwab <address@hidden>:

> This loop would create less garbage if you used '(no-op . 1) instead of
> (cons 'no-op 1).

Thank you, we can't have that now can we. Fixed.

20 mars 2020 kl. 15.34 skrev Eli Zaretskii <address@hidden>:

> First, please document this in NEWS and in the ELisp manual.  IMNSHO,
> this feature will be much less useful without documentation.

Sorry, I should have been clear on the point that this is primarily a debug and 
maintenance aid for the regexp-engine developer and not intended as a 
user-facing feature. Nobody is barred from using it, but they are expected to 
read the circuit schematics that comes with Emacs (ie, the source code).

In particular, there is no user interface to the regexp bytecode at all; users 
can't write program in it and have Emacs run them. It is also not stable in the 
slightest. Documenting the inner workings of the regexp engine would only put a 
burden on its maintainers.

>> * src/search.c (Fregexp_bytecode): New function.
>> (syms_of_search): Define the symbol.
> Which symbol is that?

I meant regexp-bytecode, sorry. Reworded.

>> * lisp/emacs-lisp/regexp-disasm.el (regexp-disasm--classes)
>> (regexp-disasm--syntax-codes, regexp-disasm, regexp-disassemble): New.
> This is a new file, so it is customary just to say "New file" without
> listing the functions, which are all new.

Understood and corrected.

>> +;; Decode compiled Emacs regexp bytecode and pretty-print.
>  "Decode compiled Emacs regexp bytecode, and pretty-print them."


>> +(defconst regexp-disasm--classes
>> +  [word lower punct space upper multibyte alpha alnum graph print blank]
>> +  "Vector of character classes, corresponding to BIT_* in regex-emacs.c.")
> This is one place where a more detailed description in the doc string
> could be most beneficial.

Expanded and made more precise.

>> +(defconst regexp-disasm--syntax-codes
>> +  [whitespace punctuation word symbol
>> +   open-parenthesis close-parenthesis expression-prefix string-quote
>> +   paired-delimiter escape character-quote comment-start comment-end
>> +   inherit comment-delimiter string-delimiter]
>> +  "Vector of syntax codes, corresponding to enum syntaxcode in syntax.h
>> +but using names from `rx'.")
> And this is another.  (Btw, the first line should be a complete
> sentence.)


> Any way to make sure these two get updated, or at least the build
> displays an alert when the C sources are modified without also
> updating the above defconst's?  At the very least, we should have
> comments in the C sources to the effect that any change there needs an
> update here.

I added comments to the C sources, since I was unsure about whether it was 
worth the trouble to add Lisp accessors to the respective constants. After all, 
if the definitions get out of sync, then regexp-disasm stops working which 
isn't a disaster.

>> +;;;###autoload
>> +(defun regexp-disasm (regexp)
> Why do we need to auto-load this?

Actually, a function that returns the bytecode in symbolic form turned out to 
be useful in its own right, and I found it handy for some programmatic uses 
like comparing the bytecodes of two regexps.

>> +Instructions are on the form (ADDRESS . INSTR) where ADDRESS is the
>                    ^^^^^^^^^^^
> "of the form"


>> +         (read-u16 (lambda (ofs) (+ (aref bc ofs)
>> +                                    (ash (aref bc (1+ ofs)) 8))))
> Why lambda-forms and not functions (or desfsubst)?

Because they need to close over variables in scope. With lexical binding, elisp 
almost feels like a real programming language!

>> +         (mb (multibyte-string-p regexp))
> Please use more self-describing names of variables.  E.g., how about
> multibyte-p in this case?  Likewise regarding "bc".

Agreed about mb, thanks. I'm keeping 'bc' because it's used everywhere, and the 
shorthand definitely increases readability.

>> +                           (str (if mb
>> +                                    (decode-coding-string raw 'utf-8-emacs)
>> +                                  raw)))
> This call to decode-coding-string needs a comment that explains why
> it's needed.

You are quite right, and decoding turned out to be more complicated as well. 

>> +               (pcase opcode
>> +                 (0 (cons 'no-op 1))
>> +                 (1 (cons 'succeed 1))
> Is pcase really needed here?  It looks like a simple cond will do.

Well, pcase is a lot more readable here, don't you think?

>> +                 (_ (error "bad opcode at ofs %d: 0x%02x" i opcode))))
>                                             ^^^
> "offset", please, not "ofs".

Right, thanks.

>> +(defun regexp-disassemble (regexp)
>> +  "Compile REGEXP and print the disassembled bytecode."
> I think the fact that it compiles PATTERN is an implementation
> detail.  The real purpose of this command is different.  Can you
> propose a better description of that purpose?

Agreed, and rewritten.

>> +  (interactive "XRegexp (evaluated): ")
> This prompt should do a better job describing what kind of input is
> expected here.

I'm not sure what else to say in the prompt. I found it more useful to input 
the regexp as a lisp expression than a string (for cut-and-paste from source 
code, or for rx) but maybe that's just me.

>> +DEFUN ("regexp-bytecode", Fregexp_bytecode, Sregexp_bytecode, 1, 1, 0,
>> +       doc: /* Compile REGEXP and return the compiled bytecode.
>> +The compiled bytecode is returned as a string; its format is
>> +implementation-dependent.  Cached bytecode may be returned if available.  
>> */)
> Is this function useful on its own, or is it an internal API?  If the
> latter, let's use an internal notation for its name, and let's
> document that it's an internal function.

It's probably not useful for anything else. I made it internal.

>> +  struct regexp_cache *cache_entry = compile_pattern (
>> +    regexp,
>> +    NULL,
>> +    (!NILP (BVAR (current_buffer, case_fold_search))
>> +     ? BVAR (current_buffer, case_canon_table) : Qnil),
>> +    false,
>> +    true);
> That's not our style of formatting such function calls.  It is hard to
> read.
> And why are you using variables related to the current buffer here?

You are right, the case folding is better made explicit. Fixed.

Revised patch attached.

Attachment: 0001-Add-regexp-bytecode-disassembler.patch
Description: Binary data

reply via email to

[Prev in Thread] Current Thread [Next in Thread]