Christopher Genovese <
address@hidden> writes:
> /Semi-verbose Preamble/.
> Having recently begun intensive use of org-mode for tangling source
> files, I encountered four issues related to comment extraction (two
> bugs, one undesirable behavior, and one ... unfulfilled need), which I
> describe in detail below. I started by creating an org file that would
> reproduce the problems, and soon started /describing/ the problems in
> the org file as well as putting my fixes in the source blocks. At the
> risk of it being too meta or annoying, I've included that org file at
> the end of this message as the problem description. All the details are
> there as well as two fixes. Tangling that file in various ways described
> demonstrates the problems, and you can export to PDF for nicer reading.
> (I've attached the PDF to this mail for convenience. It looks good;
> kudos, org-mode!) I've also attached a tarball with files that make it
> easy to try my changes and to restore the original behavior, as well as
> tests and results from the org file for easy comparison. See the
> included README.
>
> I've been using the revised code now for a few days. It fixes the
> problems I describe, and I think it provides a flexible framework for
> comment extraction with minimal change to the base code. If the reaction
> to this is positive, I will happily submit a patch, sign paperwork, or
> whatever is needed, after fixing any problems that you all see. In any
> case, I very much look forward to any feedback you can offer. Thanks.
>
> -- Christopher
>
> P.S. In case the attachments get dropped, I've put the PDF and the
> tarball at
>
>
http://www.stat.cmu.edu/~genovese/depot/tangle-error.pdf
>
http://www.stat.cmu.edu/~genovese/depot/tangle-bundle.tgz
>
> /Problem Description/
> ################ Cut Here ####################
> # -*- org-return-follows-link: t; comment-empty-lines: t; -*-
> #+TITLE: Tangle this file: four issues with org-babel-tangle
> #+AUTHOR: Christopher Genovese
> #+DATE: 14 Sep 2011\vspace*{-0.5cm}
> #+EMAIL:
address@hidden
> #+OPTIONS: toc:1 H:1
> #+BABEL: :tangle yes :comments org :results silent :exports code
> #+BIND: org-export-latex-hyperref-format "\\hyperlink{%s}{%s}"
> #+STARTUP: showall
> #+LATEX_HEADER: \usepackage[labelsep=period,labelfont=bf]{caption}
> #+LaTeX: \vspace*{-1cm}\vspace*{-2pt}
>
>
> * Four Related Issues with org-babel-tangle
>
> Running org-mode version 7.7 on both Gnu Emacs 23.2.1 and 24.0.50.1
> with Mac OS X 10.5.8 (with and without a -Q option), I encountered the
> following issues/problems/bugs when tangling files with code blocks
> for which the :comment header argument is org:
>
> 1. The subtree associated with the very first code block has
> its headline tangled without leading stars, but all subsequent
> sub-trees associated with code blocks have the leading stars
> included in the comments.
>
> 2. If the first code block comes before the first headline, the start
> of the comment text will be determined by pre-existing match data
> and thus will likely be incorrect.
>
> 3. Org structural elements such as headline stars, =#+= control lines,
> and drawers are included in the comments.
>
> 4. There is no way easy to delimit comment text or transform it that
> does not also change the structure of the org file (e.g., by adding
> headlines or source blocks).
>
> Issues 1 and 2 seem to be genuine bugs. Issues 3 and 4 are more
> subjective, I admit, but seem undesirable. Stars, drawers, and
> control lines are org structure rather than content, and they
> are often inappropriate in comments.
>
> To reproduce the behaviors for issues 1 and 3, look at the result of
> tangling this file. To reproduce issue 2 as well, remove the first two
> stars from this file and tangle again. Alternatively, within emacs,
> evaluate the =(buffer-substring ...)= sexp from the original-code code
> below at the beginning character of a source block. (You
> can also export this file to PDF for more pleasant reading.)
>
> Below, I give details on these issues and code for two fixes:
> a [[simple-fix][simple fix]] that handles the first two issues and the
> stars in the third,
> and a [[preferred-fix][better fix]] that handles all four issues in a more
> modular, customizable framework. I'd be interested in hearing
> feedback on all of this. If the reaction is positive, I will gladly
> submit a patch. Thanks for your consideration.
>
>
>
> * The Original Code and Details on the Problem
>
> The relevant section of the original code from org-mode version 7.7
> is shown below, comprising lines 344 through 357
> of ob-tangle.el in function =org-babel-tangle-collect-blocks=. With point
> is at a =#+begin_src=, it scans back either
> for a heading line or for the end of the previous code block,
> whichever comes later. The resulting region becomes the
> comment text.
>
> #+latex: \begin{figure}[h]
> #+latex: \hypertarget{original-code}{} % <<original-code>>
> #+source: original-code
> #+begin_src emacs-lisp
> (comment
> (when (or (string= "both" (cdr (assoc :comments params)))
> (string= "org" (cdr (assoc :comments params))))
> ;; from the previous heading or code-block end
> (buffer-substring
> (max (condition-case nil
> (save-excursion
> (org-back-to-heading t) (point))
> (error 0))
> (save-excursion
> (re-search-backward
> org-babel-src-block-regexp nil t)
> (match-end 0)))
> (point))))
> #+end_src
> #+latex: \caption{Original Code, lines 344--357 in {\tt ob-tangle.el}.}
> #+latex: \label{fig::original-code}
> #+latex: \end{figure}
>
> /Issue 1/. When in the first code block in the file, the
> second search fails (there is no previous code block),
> so the (match-end 0) call uses the match data from the (implicit) match
> during
> =org-back-to-heading=, which skips the stars. (Not a particularly
> transparent reference, incidentally.) For subsequent blocks, the
> =(match-end 0)= gives the end of the previous code block, which in these
> examples is earlier than the previous headline location.
>
> /Issue 2/. When the first code block lies before the first headline
> (say with some text before it), the searches fail in /both/ clauses of
> the max. So, the =match-end= will return an essentially arbitrary
> result, which is a bug.
>
> /Issue 3/. =org-back-to-heading= leaves point at the beginning of the
> stars, so a headline included in the text will have stars, except for
> the first one.
>
> /Issue 4/. Control lines at the end of the previous code block and
> before point are not filtered out and so are included in the comments.
>
>
> * A Simple Fix for the First Three Issues
>
> A small change addresses issues 1, 2, and the stars for issue 3: in
> both cases, simply use the =match-end= and replace 0 values with
> =(point-min)=. The latter gives a sensible result even if both
> computed positions are trivial (as when the first code block comes
> before the first headline) and respects narrowing.
>
> #+latex: \begin{figure}[h]
> #+latex: \hypertarget{simple-fix}{} % <<simple-fix>>
> #+begin_src emacs-lisp
> (comment
> (when (or (string= "both" (cdr (assoc :comments params)))
> (string= "org" (cdr (assoc :comments params))))
> ;; from the previous heading or code-block end
> (buffer-substring
> (max (condition-case nil
> (save-excursion
> (org-back-to-heading t) ; sets match data
> (match-end 0))
> (error (point-min)))
> (save-excursion
> (if (re-search-backward
> org-babel-src-block-regexp nil t)
> (match-end 0)
> (point-min))))
> (point))))
> #+end_src
> #+latex: \caption{Simple Fix, replacement for lines 344--357 in {\tt
> ob-tangle.el}.}
> #+latex: \label{fig::simple-fix}
> #+latex: \end{figure}
>
> * A Fix for All Four Issues
>
> A better fix that handles issues 1--4 starts with the region computed as
> in the [[simple-fix][simple fix]]
> and then processes that text through a user-configurable sequence of
> functions
> to derive the final form of the comment text.
>
> The following changes are required.
>
> ** Extract Initial Comment Text and State from Org Buffer
>
> The initial comment text ranges from either the most recent headline
> at the point after the stars, the beginning of the line after the
> =#+end_src= of the most recent code block, or the beginning of the
> buffer, whichever is later, through the line before the source
> block.[fn:1]
>
> The [[preferred-fix][code]] to extract this is given below.
> #+latex: (See Figure \ref{fig::preferred-fix}.)
> It replaces lines 344 through 357 of
> =ob-tangle.el= from org-mode version 7.7 in the function
> =org-babel-tangle-collect-blocks=.
>
> #+latex: \begin{figure}[h]
> #+latex: \hypertarget{preferred-fix}{} % <<preferred-fix>>
> #+begin_src emacs-lisp
> (comment
> (when (or (string= "both" (cdr (assoc :comments params)))
> (string= "org" (cdr (assoc :comments params))))
> (let* ((prev-heading
> (condition-case nil
> (save-excursion
> (org-back-to-heading t) ; sets match data
> (match-end 0))
> (error (point-min))))
> (end-of-prev-src-block
> (save-excursion
> (if (null (re-search-backward
> org-babel-src-block-regexp nil t))
> (point-min)
> (goto-char (match-end 0))
> (forward-line 1)
> (point))))
> (comment-start
> (max prev-heading end-of-prev-src-block))
> (comment-end
> (save-excursion
> (forward-line 0)
> (point)))
> (state
> (list (cons 'org-drawers
> org-drawers)
> (cons 'after-heading
> (= comment-start prev-heading))
> (cons 'first-line
> (= comment-start (point-min))))))
> (org-babel-process-comment-text
> (buffer-substring comment-start comment-end) state))))
> #+end_src
> #+latex: \caption{Better Fix, replacement for lines 344--357 in {\tt
> ob-tangle.el}.}
> #+latex: \label{fig::preferred-fix}
> #+latex: \end{figure}
>
> ** Adjust =org-babel-spec-to-string=
>
> The commment block collected by the [[original-code][original code]]
> #+latex: (Figure \ref{fig::original-code})
> in =org-babel-tangle-collect-blocks= is further processed in \newline
> =org-babel-spec-to-string= to trim leading and trailing whitespace
> from string. This was needed because spaces after a source block were
> included in the comment. In the revised code, however, this space
> trimming is handled during text transformation, except for removing
> trailing newlines. (Note: trailing /spaces/ are not removed to allow
> more flexibility in comment processing.) Hence,
> =org-babel-spec-to-string= needs to be slightly adjusted.
> #+latex: See Figure \ref{fig::spec-string-diff}.
>
> #+latex: \begin{figure}[h]
> #+begin_example
> --- ob-tangle.el 2011-09-14 11:48:26.000000000 -0400
> +++ new-ob-tangle.el 2011-09-14 11:55:56.000000000 -0400
> @@ -398,3 +398,3 @@
> (flet ((insert-comment (text)
> - (let ((text (org-babel-trim text)))
> + (let ((text (org-babel-chomp text "[\f\t\n\r\v]")))
> (when (and comments (not (string= comments "no"))
> #+end_example
> #+latex: \caption{Changes to {\tt org-spec-to-string} in {\tt
> ob-tangle.el}, unified diff, one line of context}
> #+latex: \label{fig::spec-string-diff}
> #+latex: \end{figure}
>
> ** Process Comment Text Through Sequence of Transforms
>
> At the end of the revised [[preferred-fix][comment collection code]],
> the comment text is passed to
> =org-babel-process-comment-text= which
> applies a sequence of transformation functions.
> #+latex: (See Figure \ref{fig::comment-transformation}.)
> The list of transformation functions is stored in a customizable
> variable described [[Define Customization Variable for
> Transforms][below]]. Several predefined transformations are
> given [[Define A Collection of Transform Functions][below]] as well.
>
> #+latex: \begin{figure}[h]
> #+begin_src emacs-lisp
> (defun org-babel-process-comment-text (text &optional state)
> "Apply list of transforms to comment TEXT assuming bindings in alist
> STATE.
> Returns the modified text string, which may have text properties.
> See `org-babel-comment-processing-functions' for the transforms to be
> applied and details on the allowed keys in the STATE alist."
> (let ((funcs org-babel-comment-processing-functions))
> (with-temp-buffer
> (insert text)
> (let ((org-drawers
> (or (cdr (assoc 'org-drawers state))
> org-drawers))
> (after-heading
> (cdr (assoc 'after-heading state)))
> (first-line
> (cdr (assoc 'first-line state))))
> (while funcs
> (goto-char (point-min))
> (funcall (car funcs))
> (setq funcs (cdr funcs))))
> (buffer-substring (point-min) (point-max)))))
> #+end_src
> #+latex: \caption{Better Fix, comment transformation driver.}
> #+latex: \label{fig::comment-transformation}
> #+latex: \end{figure}
>
> ** Define Customization Variable for Transforms
>
> A list of nullary functions applied in order to
> the comment text. The text is inserted
> in a temporary buffer, so these functions can use
> the entire Emacs library for operating on buffer text.
> #+latex: See Figure \ref{fig::comment-transformation-function-list}.
>
> #+latex: \begin{figure}[h]
> #+begin_src emacs-lisp
> (defcustom org-babel-comment-processing-functions
> '(org-babel-comment-delete-file-variables-line
> org-babel-comment-delete-org-control-lines
> org-babel-comment-delete-drawers
> org-babel-comment-trim-blank-lines
> org-babel-comment-trim-indent-prefix)
> "List of functions to transform source-block comment text before
> insertion.
> Each function will be called with no arguments with point at the
> beginning of a buffer containing only the comment text. Each
> function can modify the text at will and leave point anywhere,
> but it should *not* modify the narrowing state of the buffer.
> Several dynamic state variables are set prior to execution that
> each function can reference. These currently include:
>
> + org-drawers: names of drawers in the original org buffer.
> + from-heading: t if comment starts at an org heading line,
> nil otherwise.
> + first-line: t if initial comment starts on first line
> of the original org buffer, nil otherwise.
>
> If a function changes the value of these state variables, the new
> value will be seen by all following functions in the list, but
> this is not generally recommended.
>
> The functions in this list are called *in order*, and this order
> can influence the form of the resulting comment text."
> :group 'org-babel
> :type 'list)
> #+end_src
> #+latex: \caption{Better Fix, customizable transform list.}
> #+latex: \label{fig::comment-transformation-function-list}
> #+latex: \end{figure}
>
> ** Define A Collection of Transform Functions
>
> An advantage of this design is that transformation of the
> comments is modular and customizable. We can include in
> =ob-tangle.el= a collection of pre-defined transforms.
> The default processing stream in =org-babel-comment-processing-functions=
> is as follows:
>
> 1. Delete a file variables if on the first line of the buffer.
> 2. Delete all drawers and their contents.
> 3. Delete all org control lines from the comment text.
> 4. Trim blank lines from the beginning and end.
> 5. Reindent the text by removing the longest common leading
> string of spaces.
>
> #+ TANGLE: end-comment
> These and several other useful transforms are given below
> (e.g., deleting drawer delimiters but not contents)..
> #+latex: See Figures \ref{fig::transformA}--\ref{fig::transformZ}.
> It is easy to define new transforms; any function that
> operates on text in the current buffer beginning at point-min
> will work.
>
> #+latex: \begin{figure}[h]
> #+begin_src emacs-lisp
> (defun org-babel-comment-delete-file-variables-line ()
> "Delete file variables comment line if at beginning of buffer.
> This only checks the first line of the buffer, and so should be
> placed first (or at least early enough) in the list
> `org-babel-comment-processing-functions' to ensure that the no
> other text has been inserted earlier."
> (when (and first-line
> (looking-at ; file-variables line
> "^#[ \t]*-\\*-.*:.*;[ \t]**-\\*-[ \t]*$"))
> (let ((kill-whole-line t))
> (kill-line))))
> #+end_src
> #+latex: \caption{Comment Transform.}
> #+latex: \label{fig::transformA}
> #+latex: \end{figure}
>
> #+latex: \begin{figure}[h]
> #+begin_src emacs-lisp
> (defun org-babel-comment-delete-org-control-lines ()
> "Remove all org #+ control lines from comment."
> (let ((control-regexp "^[ \t]*#\\+.*$"))
> (delete-matching-lines control-regexp)))
> #+end_src
> #+latex: \caption{Comment Transform.}
> #+latex: \end{figure}
> #+latex: \begin{figure}[h]
> #+begin_src emacs-lisp
> (defun org-babel-comment-delete-org-in-buffer-settings ()
> "Remove all org #+ in-buffer setting lines, leaving other control
> lines.
> In-buffer setting lines begin with #+ and have all caps keyword
> names."
> (let ((setting-regexp "^#\\+[ \t]*[A-Z_]+:.*$"))
> (delete-matching-lines setting-regexp)))
> #+end_src
> #+latex: \caption{Comment Transform.}
> #+latex: \end{figure}
> #+latex: \begin{figure}[h]
> #+begin_src emacs-lisp
> (defun org-babel-comment-delete-drawers ()
> "Delete drawer delimiters and contents from comment.
> Drawer names are restricted to those in the `org-drawers' state."
> (let ((drawer-start-regexp
> (format "^[ \t]*:\\(?:%s\\):[ \t]*$"
> (mapconcat 'identity
> org-drawers
> "\\|")))
> (drawer-end-regexp "^[ \t]*:END:[ \t]*$"))
> (while (re-search-forward drawer-start-regexp nil t)
> (let ((beg (save-excursion
> (forward-line 0)
> (point)))
> (end (save-excursion
> (re-search-forward drawer-end-regexp nil t)
> (forward-line 1)
> (point))))
> (goto-char end)
> (delete-region beg end)))))
> #+end_src
> #+latex: \caption{Comment Transform.}
> #+latex: \end{figure}
> #+latex: \begin{figure}[h]
> #+begin_src emacs-lisp
> (defun org-babel-comment-delete-drawer-delimiters ()
> "Delete drawer delimiters from comment leaving content.
> Drawer names are restricted to those given by the `org-drawers'
> state."
> (let ((drawer-delim-regexp
> (format "^[ \t]*:\\(?:%s\\)"
> (mapconcat 'identity
> (cons "END" org-drawers)
> "\\|"))))
> (delete-matching-lines drawer-delim-regexp)))
> #+end_src
> #+latex: \caption{Comment Transform.}
> #+latex: \end{figure}
> #+latex: \begin{figure}[h]
> #+begin_src emacs-lisp
> (defun org-babel-comment-trim-blank-lines ()
> "Trim whitespace-only lines from beginning and end of text."
> (while (and (looking-at "^[ \t\f]*$")
> (< (point) (point-max)))
> (forward-line 1))
> (delete-region (point-min) (point))
> (when (< (point) (point-max))
> (goto-char (point-max))
> (let ((last-point (point)))
> (forward-line 0)
> (while (and (looking-at "^[ \t\f]*$")
> (> (point) (point-min)))
> (setq last-point (point))
> (forward-line -1))
> (delete-region last-point (point-max)))))
> #+end_src
> #+latex: \caption{Comment Transform.}
> #+latex: \end{figure}
> #+latex: \begin{figure}[h]
> #+begin_src emacs-lisp
> (defun org-babel-comment-trim-indent-prefix ()
> "Remove longest common leading prefix of spaces from each line of
> TEXT.
> Prefix is computed from the initial whitespace on each line with
> tabs converted to spaces, preserving indentation."
> (let* ((common-indent nil)
> (common-length (1+ (- (point-max) (point-min))))
> (current-indent "") ; enter first loop
> (current-length common-length)) ; skip first assignment
> (goto-char (point-min))
> (while current-indent
> (when (< current-length common-length)
> (setq common-indent current-indent
> common-length current-length))
> (setq current-indent
> (let* ((found (re-search-forward "^\\([ \t]*\\)\\S-" nil
> t))
> (bol (match-beginning 0))
> (eos (match-end 1))
> (space-str (match-string 1))
> (indent-tabs-mode nil))
> (cond
> ((not found)
> nil)
> ((not (string-match "\t" space-str))
> space-str)
> (t ; detabify indent string
> (goto-char eos)
> (let ((col (current-column)))
> (delete-region bol eos)
> (indent-to col))
> (buffer-substring-no-properties bol (point))))))
> (setq current-length (length current-indent)))
> (when (and common-indent (> common-length 0))
> (let ((indent-re (concat "^" common-indent)))
> (goto-char (point-min))
> (while (re-search-forward indent-re nil t)
> (replace-match "" nil nil))))))
> #+end_src
> #+latex: \caption{Comment Transform.}
> #+latex: \label{fig::transformZ}
> #+latex: \end{figure}
> #+latex: \end{itemize}
> #+latex: \noindent
> This kind of customization offers some nice possibilities,
> including controlling indentation, eliminating or
> transforming org markup, eliminating trailing whitespace, and
> automating specialized comment formatting (e.g., javadoc). As
> an additional illustration, consider the transform
> =org-babel-comment-restrict-comment-range=
> #+latex: in Figure \ref{fig::transform-illustration}
> below. The idea is that it is sometimes useful to select
> from the text under a headline a /part/ of the text for
> the comment. We want some org markup that will not affect
> either the export or the structure of the org file itself.
> To do this, we use the fact that =#+=\textvisiblespace
> lines are not exported.[fn:2] So, we can /de facto/ use the
> =#+ TANGLE:= construct to control various aspects of tangling.
> Here, we use the =#+ TANGLE: start-comment= and
> =#+ TANGLE: end-comment= to delimit the comment text.
> (This function needs to come earlier in the function list than the
> functions that eliminate org control lines. It is sufficient to
> prepend it to that list.) This is used in the current file,
> for example.
>
> #+latex: \begin{figure}[h]
> #+begin_src emacs-lisp
> (defun org-babel-comment-restrict-comment-range ()
> "Remove all comment text outside start-comment and end-comment
> delimiters.
> Comment delimiters are #+TANGLE lines with respective keywords
> start-comment and end-comment. THE #+TANGLE lines are also
> deleted. To be effective, this function should be positioned in
> the list `org-babel-comment-processing-functions' before any
> functions that remove org control lines or process other
> co-occuring attributes of #+TANGLE lines."
> (when (re-search-forward "^[ \t]*#\\+[ \t]*TANGLE:.*start-comment.*$"
> nil t)
> (forward-line 1)
> (delete-region (point-min) (point)))
> (when (re-search-forward "^[ \t]*#\\+[ \t]*TANGLE:.*end-comment.*$"
> nil t)
> (forward-line 0)
> (delete-region (point) (point-max))))
> #+end_src
> #+latex: \caption{Transform to illustrate some customization
> possibilities.}
> #+latex: \label{fig::transform-illustration}
> #+latex: \end{figure}
> #+latex: \begin{itemize}
>
> [fn:1] In the original code and in the simple fix above, the comment
> starts /immediately/ after the =#+end_src= rather than at the start of
> the next line. Starting at the next line seems more natural to me
> because the comment being constructed relates to the /following/ code
> block. But the original behavior is easily restored if people disagree.
>
> [fn:2] A feature request: I would propose that the =#+tangle:= construct
> be recognized as non-exported even with spaces preceding the =#= and no
> spaces after the =+=. This would enable a variety of interesting
> customization for tangled comments. Alternatively, a generic construct
> such as =#+noop:= or =#+generic:= could be a valuable for user-based
> tags in an org file that serves a similar purpose -- allow customized
> processing without directly being exported.
>
>