[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [O] Erroneous "No such file or directory" with babel and remote dir

From: Nick Dokos
Subject: Re: [O] Erroneous "No such file or directory" with babel and remote dir
Date: Thu, 15 Nov 2012 01:55:29 -0500

In a previous mail, I wrote:

| Which tells me that the tmp file error is a red herring and the real
| breakage occurred after 5cb80c7, probably through a commit that touched
| ob-sh.el (although that's far from guaranteed). Here's that list:
| $ git log --oneline -- lisp/ob-sh.el
| 86e515d fix remote execution w/empty shebang header arg
| 70dd119 Massive code clean-up.
| 966447c Don't use `org-flet' in ob-awk.el and ob-sh.el
| 8eb5843 Add punctuation at the end of the first line of docstrings.  Code 
| 63b5f8f replace flet/labels with org-flet/org-labels
| ecd0562 Fix the master branch.
| 6e306f6 Fix copyright years in maint.
| de42649 Manually revert maint to e85080.
| 73bb18b Manually revert to the Release 7.8.04 tag.
| 38c5045 Fix copyright years.
| 6e534f9 Manually revert back to commit e85080.
| 5cb80c7 apply :shebang and :padline to shell script execution
| ...

Continuing along these lines, I think I've figured it out and it's not

The executive summary is that the last commit in the list introduced a
small bug that was fixed by the first commit in the list. But neither of
those has much to do with the remote-dir breakage (except that the bug
made things harder to bisect). The principal culprits are two sets of
commits for code transformations that were supposed to do nothing
functionally: they were just supposed to get away from using flet/labels
(which are deprecated and will be obsolete by emacs version 24.3.)

The end result is that one change fixes the remote dir problem in
Loris's example. I'm not sure that it solves every such problem though:
I haven't audited all the code.

The change is a one-liner: in org-babel-shell-command-on-region,
replace the line

                (call-process-region start end shell-file-name t
                (funcall call-process-region start end shell-file-name t

BTW, I'm using Org-mode version 7.9.2
(release_7.9.2-582-g6d099e.dirty @ /home/nick/elisp/org-mode/lisp/).

The rest of the email explains why the change is needed (in excruciating
detail: grab a beer or maybe a cup of coffee before starting on it.)

I hope this is (mostly) correct but corrections would be more than
welcome. The problem seems multi-faceted and therefore needs as many
eyes on it as possible.


P.S. I'm not sure whether to thank Loris or to curse him for pushing me
on this path, but there is no question that he is responsible for
finding the bug, providing the reproducer and then beating on the gates
with pitchfork and torch :-)


* Root cause

The root cause of the problem was a set of code transformations that were
supposed to leave the functionality intact. The code transformations
were driven by the need to replace the flet/labels constructs which were
declared obsolete (as of 24.3 - they are still available, but cause warnings
to be issued).

The problem was hard to find because there were four commits (at widely varying
times) that contributed to various manifestations of the problem and
it was difficult to bisect. In chronological order, they were

- commit 5cb80c7 apply :shebang and :padline to shell script execution

  This did not cause the problem, but it introduced a bug that causes
  errors when executing a source block with a remote :dir spec. That
  confused the issue (at least it confused me: I fingered this as the
  culprit in a bisect, but it was only guilty of the bug fixed by
  commit 86e515d - see below -, not of the remote-dir problem).

- commit 63b5f8f replace flet/labels with org-flet/org-labels

  Section [[flet --> org-flet]] describes this.
- commit e85479a Don't use `org-flet' in some functions

  and several others that slowly got rid of org-flet in favor of let,
  and org-labels (somehow - I didn't check this carefully), the latter
  reverted and reapplied, presumably because problems were found and
  fixed in several iterations. I've only skimmed the surface here: I
  think this has the potential to be a minefield of problems waiting to
  explode - see the [[org-flet --> let]] section below.

- commit 86e515d fix remote execution w/empty shebang header arg

  This finally fixed the little bug that was introduced by 5cb80c7.

The first and the last of these commits are irrelevant to the remote-dir
problem, except that the bug gets in the way of testing: any version
later than 5cb80c7 exhibits the bug and that bug hides whether the
remote-dir problem is present or not.

The general procedure I followed was to make a branch with some commit
as its tip and then manually apply the patch of commit 86e515d. Only
then could I test for the remote-dir problem. In the following, when
I say commit X, I mean commit X *plus* the manually applied patch
from 86e515d.

** flet --> org-flet 
# <<flet --> org-flet>>

The first set of code transformations (implemented as commit
63b5f8f2e85b3059a2d30041db6939347a7a2d7d) dealt with the situation by
doing a mass substitution: flet --> org-flet and labels --> org-labels
(and in at least one case, flet --> org-labels to deal with a
recursive definition - I presume that was a preexisting bug that was
fixed by this substitution) and adding compatibility aliases in
org-compat.el to use the cl-flet/cl-labels macros from cl.el in emacs
versions >= 24.1.50.

I found out that already this broke the remote dir
functionality. Since this is a large but straightforward commit, I
split it up into separate patches for each affected file and applied
each patch in order: org-compat.el implements the org-flet/org-labels
functionality, but since the other patches were not applied yet, I was
still using flet/labels (obsolete but still present).  I then applied
the patch to org-macs.el, org.el and ob.el, testing after each one. At
first, things were working but applying the patch to ob.el broke
remote-dir. I then determined that most of the transformation in ob.el
were benign (at least as far as *this* problem is concerned), except
one: when I reverted that single one from org-flet back to flet,
remote-dir started working again.

The "bad" transformation was in this code segment in
org-babel-execute-src-block around line 544 in ob.el (but before you
go looking for it, be warned: you will not find it if you have a
reasonably current version of org - read the next section):

            (flet ((call-process-region (&rest args)
                    (apply 'org-babel-tramp-handle-call-process-region args)))
              (org-flet ((lang-check (f)
                       (let ((f (intern (concat "org-babel-execute:" f))))
                         (when (fboundp f) f))))

If I left the first flet alone, remote-dir worked; if I change it to
org-flet it does not.

Here's the procedure I used to prove this to my satisfaction, just in
case anybody wants to try to duplicate my results:

     - create a branch with 63b5f8f as its tip and switch to it:

       git checkout -b foo 63b5f8f

     - make a patch for the "little" bug and apply it:

       git show 86e515d > patch.to.apply
       git apply patch.to.apply

     - Put Loris's example in a file loris.org (with appropriate modifications -
       user id and remote machine name):

       #+BEGIN_SRC sh :dir /address@hidden:

     - Add a (require 'ob-sh) to your minimal .emacs

     - Start emacs with

       emacs -q -l /path/to/minimal/.emacs /path/to/loris.org

       and C-c C-c in the code block. If it behaves as I expect,
       it should print the local hostname, not the remote one.

     - Now edit lisp/ob.el and change that single org-flet back to an flet
       (remember you are in a branch, so changes here won't affect the files
       in other branches) and test again: it should now print the remote

     - Before you can switch to some other branch, you'll need to either
       commit these changes or throw them away - I'll leave that up to you.

So the moral of the story is that the code transformations have *not*
left functionality unchanged. Something went awry but to be honest, I
don't know what. I didn't spend much time on it because of what I
found out next.

** org-flet --> let
# <<org-flet --> let>>
The second set of code transformations is more difficult to describe
because it's not just a (more or less) straight substitution: it tries
to completely eliminate the need for an flet-like construct and
replace it with a simple let.

The trouble here is that in many, but by no means all lisps, a symbol
can have *two* bindings - emacs lisp is one such lisp). The two are a
function binding and a value binding. Which binding is used depends on
the context: (f a b c) is usually rendered as "call the function f with
arguments a, b and c" but it really says "look up the function binding
of the symbol f and apply the resulting function to the values obtained
by looking up the value bindings of the symbols a, b and c". setq is the
usual way to set the value binding of a symbol and defun is the usual
way to set the function binding of a symbol. So you could do the
following (although it's probably a bad idea):

(defun f (x)
   (* x x))

to set the function binding of the symbol f and

(setq f 3)

to set the value binding of the symbol f. Then you could do

(f f)

calling the function in the function binding of f with argument the
value binding of the symbol f. Since the former is the squaring
function and the latter is the value 3, the result is 9. That may
be confusing but it is legal.

Now the *values* bound to symbols can be locally overridden with `let'
and the functions bound to symbols can be locally overridden with
`flet'. But now that `flet' is going away, how do you override the
function binding using `let'? Without `flet' (or something similar),
you just cannot: trying to use `let' to change the function binding
of a symbol is impossible. So how exactly is this second set of
transformations supposed to work?

It uses `let' to locally override the *value* binding of a symbol
with a *function*:

(defun f (x y z))
   (let ((g (lambda (x) (* x x))))

would bind the squaring function to the symbol g - but it is 
the value binding of g, not its function binding. When the time
comes to call that function, you cannot just say

(defun f (x y z)
   (let ((g (lambda (x) (* x x))))
       (+ (g x) (g y) (g z))))

There is no function binding for the symbol g so you cannot use it in
the function position. That's where funcall (or apply in a different
context) can help. The functtion  can be written like this

(defun f (x y z)
   (let ((g (lambda (x) (* x x))))
       (+ (funcall g x)) (funcall g y) (funcall g z)))

That's the price one has to pay in order to eliminate flet and replace
it with let.

As you can see, this is a rather intrusive transformation. And it's even
worse: an flet in some function would define a function that could be
passed down an arbitrarily long call chain and then called at the lowest
level. If you use let, you can still pass the symbol down the chain but
when it is time to call the function, you need to use the funcall trick.
And even more obscurely, you don't have to pass it down explicitly: dynamic
binding will take care of finding the definition of the symbol an arbitrary
distance up the call chain. But if you have let-bound a function to a symbol,
you have to funcall it explicitly. And if the symbol *has* a function value
already, then calling it directly will also work but it will call a completely
different function!

That's an easy thing to overlook and that's exactly what happened here
I believe:

In org-babel-execute-src-block, the code looks like this:

            (let ((call-process-region
                   (lambda (&rest args)
                     (apply 'org-babel-tramp-handle-call-process-region args))))
              (let ((lang-check (lambda (f)
                                  (let ((f (intern (concat "org-babel-execute:" 
                                    (when (fboundp f) f)))))
                (setq cmd
                      (or (funcall lang-check lang)
                          (funcall lang-check (symbol-name
                                               (cdr (assoc lang 
                          (error "No org-babel-execute function for %s!" 


                (setq result
                      ((lambda (result)
                         (if (and (eq (cdr (assoc :result-type params)) 'value)
                                  (or (member "vector" result-params)
                                      (member "table" result-params))
                                  (not (listp result)))
                             (list (list result)) result))
                       (funcall cmd body params)))

Here we get the src block language (sh in Loris's example) and construct
a symbol based on it (org-babel-execute:sh in this case). Because of the
let, lang-check cannot be called as a function: it needs to be
funcall'ed. It binds cmd to the symbol org-babel-execute-sh and later
funcalls it. But note that call-process-region is also let-bound to a
function. So we have the call chain

org-babel-execute-src-block --> [through (funcall cmd body params)]
  org-babel-execute:sh --> 
    org-babel-sh-evaluate -->
      org-babel-eval -->
        org-babel-shell-command-on-region -->

And here we have a symbol that has a function binding as well as a value
binding (also a function, but a *different* function). The symbol is used
in the function position of the function application so the function
binding (i.e the usual call-process-region function) is used. It runs
hostname locally - Loris loses.

The correct thing to do (in this particular case at least) is to funcall
the symbol at the lowest level. That would call the let-bound function
(org-babel-tramp-handle-call-process-region) which would call hostname
remotely (making Loris happy).

Is it always correct to do that? I don't know but I suspect not: it
would probably be safer to have an org symbol
(org-call-process-region-function maybe) that's value-set to the
standard call-process-region function, but can then be let-bound and
dynamically passed all over the place.

And is this the only problem? Probably not: every flet->let
transformation would have to be scrutinized.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]