Re: Regexp bytecode disassembler

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Regexp bytecode disassembler

From:	Eli Zaretskii
Subject:	Re: Regexp bytecode disassembler
Date:	Fri, 20 Mar 2020 16:34:16 +0200

> From: Mattias Engdegård <address@hidden>
> Date: Fri, 20 Mar 2020 13:27:35 +0100
> 
> This patch adds a lisp-based regexp bytecode disassembler which is always 
> available without any runtime cost to the regexp engine. It is mainly a tool 
> for maintainers but curious users may find it useful as well. It has already 
> revealed one bug in the regexp compiler, now fixed (f189e5dc10).
> 
> Any objections against it being added (to master)?

Thanks.  I'm far from being an expert on the subject, so my comments
are limited to secondary aspects of this.

First, please document this in NEWS and in the ELisp manual.  IMNSHO,
this feature will be much less useful without documentation.

Second, please document the code more than you did.  I especially miss
any documentation of the regexp bytecode that is being disassembled;
it is IMO sub-optimal to ask people to read the regex C code to glean
that information while trying to understand what the disassembler does
and why.

> * src/search.c (Fregexp_bytecode): New function.
> (syms_of_search): Define the symbol.

Which symbol is that?

> * lisp/emacs-lisp/regexp-disasm.el (regexp-disasm--classes)
> (regexp-disasm--syntax-codes, regexp-disasm, regexp-disassemble): New.

This is a new file, so it is customary just to say "New file" without
listing the functions, which are all new.

> +;; Decode compiled Emacs regexp bytecode and pretty-print.

  "Decode compiled Emacs regexp bytecode, and pretty-print them."

> +(defconst regexp-disasm--classes
> +  [word lower punct space upper multibyte alpha alnum graph print blank]
> +  "Vector of character classes, corresponding to BIT_* in regex-emacs.c.")

This is one place where a more detailed description in the doc string
could be most beneficial.

> +(defconst regexp-disasm--syntax-codes
> +  [whitespace punctuation word symbol
> +   open-parenthesis close-parenthesis expression-prefix string-quote
> +   paired-delimiter escape character-quote comment-start comment-end
> +   inherit comment-delimiter string-delimiter]
> +  "Vector of syntax codes, corresponding to enum syntaxcode in syntax.h
> +but using names from `rx'.")

And this is another.  (Btw, the first line should be a complete
sentence.)

Any way to make sure these two get updated, or at least the build
displays an alert when the C sources are modified without also
updating the above defconst's?  At the very least, we should have
comments in the C sources to the effect that any change there needs an
update here.

> +;;;###autoload
> +(defun regexp-disasm (regexp)

Why do we need to auto-load this?

> +Instructions are on the form (ADDRESS . INSTR) where ADDRESS is the
                    ^^^^^^^^^^^
"of the form"

> +         (read-u16 (lambda (ofs) (+ (aref bc ofs)
> +                                    (ash (aref bc (1+ ofs)) 8))))
> +         (read-u24 (lambda (ofs) (+ (aref bc ofs)
> +                                    (ash (aref bc (+ ofs 1)) 8)
> +                                    (ash (aref bc (+ ofs 2)) 16))))
> +         (read-s16 (lambda (ofs) (let ((x (funcall read-u16 ofs)))
> +                                   (- x (ash (logand x #x8000) 1)))))

Why lambda-forms and not functions (or desfsubst)?

> +         (mb (multibyte-string-p regexp))

Please use more self-describing names of variables.  E.g., how about
multibyte-p in this case?  Likewise regarding "bc".

> +                           (str (if mb
> +                                    (decode-coding-string raw 'utf-8-emacs)
> +                                  raw)))

This call to decode-coding-string needs a comment that explains why
it's needed.

> +               (pcase opcode
> +                 (0 (cons 'no-op 1))
> +                 (1 (cons 'succeed 1))

Is pcase really needed here?  It looks like a simple cond will do.

> +                 (_ (error "bad opcode at ofs %d: 0x%02x" i opcode))))
                                             ^^^
"offset", please, not "ofs".

> +(defun regexp-disassemble (regexp)
> +  "Compile REGEXP and print the disassembled bytecode."

I think the fact that it compiles PATTERN is an implementation
detail.  The real purpose of this command is different.  Can you
propose a better description of that purpose?

> +  (interactive "XRegexp (evaluated): ")

This prompt should do a better job describing what kind of input is
expected here.

> +DEFUN ("regexp-bytecode", Fregexp_bytecode, Sregexp_bytecode, 1, 1, 0,
> +       doc: /* Compile REGEXP and return the compiled bytecode.
> +The compiled bytecode is returned as a string; its format is
> +implementation-dependent.  Cached bytecode may be returned if available.  */)

Is this function useful on its own, or is it an internal API?  If the
latter, let's use an internal notation for its name, and let's
document that it's an internal function.

> +  struct regexp_cache *cache_entry = compile_pattern (
> +    regexp,
> +    NULL,
> +    (!NILP (BVAR (current_buffer, case_fold_search))
> +     ? BVAR (current_buffer, case_canon_table) : Qnil),
> +    false,
> +    true);

That's not our style of formatting such function calls.  It is hard to
read.

And why are you using variables related to the current buffer here?
It sounds like something that will cause confusion in subtle cases
down the road.  I'm aware that we do something like that in search
APIs, but (a) that is controversial as well; and (b) this API is
different: it is not intended to search a buffer, and thus what the
current buffer-local settings are is almost completely irrelevant.  I
suggest instead to expose this argument to Lisp, so that callers could
decide what they want there.

Thanks again for working on this.

[Prev in Thread]

Current Thread

[Next in Thread]

Regexp bytecode disassembler, Mattias Engdegård, 2020/03/20
- Re: Regexp bytecode disassembler, Andreas Schwab, 2020/03/20
- Re: Regexp bytecode disassembler, Eli Zaretskii <=
  - Re: Regexp bytecode disassembler, Mattias Engdegård, 2020/03/21
    - Re: Regexp bytecode disassembler, Eli Zaretskii, 2020/03/21
    - Re: Regexp bytecode disassembler, Štěpán Němec, 2020/03/21
    - Re: Regexp bytecode disassembler, Eli Zaretskii, 2020/03/21
    - Re: Regexp bytecode disassembler, Štěpán Němec, 2020/03/21
    - Re: Regexp bytecode disassembler, Eli Zaretskii, 2020/03/22
    - Re: Regexp bytecode disassembler, Štěpán Němec, 2020/03/22
    - Re: Regexp bytecode disassembler, Eli Zaretskii, 2020/03/22
    - Re: Regexp bytecode disassembler, Štěpán Němec, 2020/03/22
    - Re: Regexp bytecode disassembler, Eli Zaretskii, 2020/03/22

Prev by Date: Calc; v p fails in script
Next by Date: Re: Regexp bytecode disassembler
Previous by thread: Re: Regexp bytecode disassembler
Next by thread: Re: Regexp bytecode disassembler
Index(es):
- Date
- Thread