auctex-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Saving TeX auto parsers's regexp groups


From: Arash Esbati
Subject: Saving TeX auto parsers's regexp groups
Date: Wed, 19 Jul 2023 12:43:58 +0200
User-agent: Gnus/5.13 (Gnus v5.13)

Hi all,

there is a comment in the function `TeX-auto-parse-region' saying:

  ;; TODO: Emacs allows at most 255 groups in a regexp, see the
  ;; "#define MAX_REGNUM 255" in regex-emacs.c.  If our regex
  ;; has more groups, bad things may happen, e.g.,
  ;; (match-beginning 271) returns nil although the regexp that
  ;; matched contains group number 271.  Sadly, MAX_REGNUM is
  ;; not exposed to Lisp, so we need to hard-code it here (and
  ;; sometimes check if it increased in newer Emacs versions).

I haven't asked Emacs about enlarging the value, but I see that AUCTeX
is also somewhat wasteful on regexp groups during parsing.  Starting
with plain-TeX mode, open a file like this:

--8<---------------cut here---------------start------------->8---
\noindent
{\bf TeX-engine}:

\noindent
This variable allows you to choose which TeX engine should be used for
typesetting the document, i.e. the executables which will be used when
you invoke the `TeX' or `LaTeX' commands.

\bye

%%% Local Variables:
%%% mode: plain-tex
%%% TeX-master: t
%%% End:
--8<---------------cut here---------------end--------------->8---

and then eval in scratch this function:

--8<---------------cut here---------------start------------->8---
(defun TeX-auto-parse-region (regexp-list beg end)
  "Parse TeX information according to REGEXP-LIST between BEG and END."
  (if (symbolp regexp-list)
      (setq regexp-list (and (boundp regexp-list) (symbol-value regexp-list))))
  (if regexp-list
      ;; Extract the information.
      (let* (groups
             (count 1)
             (regexp (concat "\\("
                             (mapconcat
                              (lambda(x)
                                (push (cons count x) groups)
                                (setq count
                                      (+ 1 count
                                         (TeX-regexp-group-count (car x))))
                                (car x))
                              regexp-list "\\)\\|\\(")
                             "\\)"))
             syms
             lst)
        ;; TODO: Emacs allows at most 255 groups in a regexp, see the
        ;; "#define MAX_REGNUM 255" in regex-emacs.c.  If our regex
        ;; has more groups, bad things may happen, e.g.,
        ;; (match-beginning 271) returns nil although the regexp that
        ;; matched contains group number 271.  Sadly, MAX_REGNUM is
        ;; not exposed to Lisp, so we need to hard-code it here (and
        ;; sometimes check if it increased in newer Emacs versions).
        ;; The following line added:
        (message (format "The TeX auto-parser's regexp used %d groups" count))
        ;; End addition
        (when (> count 255)
          (error "The TeX auto-parser's regexp has too many groups (%d)" count))
        (setq count 0)
        (goto-char (if end (min end (point-max)) (point-max)))
        (while (re-search-backward regexp beg t)
          (let* ((entry (cdr (TeX-member nil groups
                                         (lambda (_a b)
                                           (match-beginning (car b))))))
                 (symbol (nth 2 entry))
                 (match (nth 1 entry)))
            (unless (TeX-in-comment)
              (looking-at (nth 0 entry))
              (if (fboundp symbol)
                  (funcall symbol match)
                (puthash (if (listp match)
                             (mapcar #'TeX-match-buffer match)
                           (TeX-match-buffer match))
                         (setq count (1- count))
                         (cdr (or (assq symbol syms)
                                  (car (push
                                        (cons symbol
                                              (make-hash-table :test #'equal))
                                        syms)))))))))
        (setq count 0)
        (dolist (symbol syms)
          (setq lst (symbol-value (car symbol)))
          (while lst
            (puthash (pop lst)
                     (setq count (1+ count))
                     (cdr symbol)))
          (maphash (lambda (key value)
                     (push (cons value key) lst))
                   (cdr symbol))
          (clrhash (cdr symbol))
          (set (car symbol) (mapcar #'cdr (sort lst #'car-less-than-car)))))))
--8<---------------cut here---------------end--------------->8---

and go back to the tex file and hit 'C-c C-n'.  You should get

  The TeX auto-parser’s regexp used 19 groups

in Messages buffer.  If I apply this patch to tex.el,

--8<---------------cut here---------------start------------->8---
diff --git a/tex.el b/tex.el
index 064e694d..c7010f94 100644
--- a/tex.el
+++ b/tex.el
@@ -4267,21 +4267,16 @@ alter the numbering of any ordinary, non-shy groups.")

 (defvar plain-TeX-auto-regexp-list
   (let ((token TeX-token-char))
-    `((,(concat "\\\\def\\\\\\(" token "+\\)[^a-zA-Z@]")
+    `((,(concat "\\\\\\(?:def\\|let\\)\\\\\\(" token "+\\)[^a-zA-Z@]")
        1 TeX-auto-symbol-check)
-      (,(concat "\\\\let\\\\\\(" token "+\\)[^a-zA-Z@]")
-       1 TeX-auto-symbol-check)
-      (,(concat "\\\\font\\\\\\(" token "+\\)[^a-zA-Z@]") 1 TeX-auto-symbol)
-      (,(concat "\\\\chardef\\\\\\(" token "+\\)[^a-zA-Z@]") 1 TeX-auto-symbol)
-      (,(concat "\\\\new\\(?:count\\|dimen\\|muskip\\|skip\\)\\\\\\(" token
-                "+\\)[^a-zA-Z@]")
+      (,(concat "\\\\"
+                (regexp-opt '("font" "newfont" "chardef" "mathchardef"
+                              "newcount" "newdimen" "newmuskip" "newskip"))
+                "{?\\\\\\(" token "+\\)}?[^a-zA-Z@]")
        1 TeX-auto-symbol)
-      (,(concat "\\\\newfont{?\\\\\\(" token "+\\)}?") 1 TeX-auto-symbol)
       (,(concat "\\\\typein\\[\\\\\\(" token "+\\)\\]") 1 TeX-auto-symbol)
       ("\\\\input +\\([^#}%\"\\\n\r]+?\\)\\(?:\\.[^#}%/\"\\.\n\r]+\\)?"
-       1 TeX-auto-file)
-      (,(concat "\\\\mathchardef\\\\\\(" token "+\\)[^a-zA-Z@]")
-       1 TeX-auto-symbol)))
+       1 TeX-auto-file)))
   "List of regular expression matching common plain TeX macro definitions.")

 (defvar TeX-auto-full-regexp-list plain-TeX-auto-regexp-list
--8<---------------cut here---------------end--------------->8---

rebuild AUCTeX, restart Emacs and do the same procedure, I get

  The TeX auto-parser’s regexp used 9 groups

which is 10 regexp groups less.  So the question is: Should we go this
route?  I didn't run any benchmarks reg. performance if we put single
regexp's entries into one single larger regexp.  And latex.el will be
the target.  Running a similar test with a LaTeX file says:

  The TeX auto-parser’s regexp used 125 groups

So half of the cake is gone upon loading vanilla LaTeX; no other AUCTeX
styles loaded.

Any comments welcome.

Best, Arash



reply via email to

[Prev in Thread] Current Thread [Next in Thread]