[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Misleading error message from lt_dlopen()

From: Jeff Squyres
Subject: Misleading error message from lt_dlopen()
Date: Thu, 23 Oct 2008 09:35:04 -0400


We have run across a misleading error message from lt_dlerror() after a failed lt_dlopenadvise() in LT 2.2.6a that caused considerable confusion for some developers in the Open MPI project for a while; I had to step through lt_dlopen() to figure out what was going on.

Open MPI uses lots of DSO plugins, and we use lt_dlopenadvise() to open them. One of our developers is working in a temp branch creating some new functionality, including some new DSOs. However, lt_dlopenadvise() was returning NULL for one of his new DSOs, and lt_dlerror() was returning "file not found". We could clearly see that the .la and .so files for the DSO were in the Right place in the filesystem, and the string filename we were passing to lt_dlopen() was correct. The DSO has no obscure library dependencies; ldd showed that all of them are present. So how could the error be "file not found"?

(I was doing my testing on an RHEL4U4 system, but I think the problem is a bit more generic)

I stepped through lt_dlopen() and discovered the real error: his DSO was referencing a symbol that didn't exist, and therefore the underlying dlopen() failed. dlopen.c:198 correctly called dlerror() and LT__SETERRORSTR() to set the error string to "/home/jsquyres/bogus/ lib/openmpi/mca_routed_binomial.so: undefined symbol: orte_routed_tree_t_class" (this is the real error), and returned NULL for the module.

So far, so good.

But then tryall_dlopen() advances on to the next loader -- lt_preopen. It [predictably] fails because we have not preopened this DSO. But then preopen.c:188 calls LT__SETERROR(FILE_NOT_FOUND). This is now the last error reported, but it really isn't accurate.

Later, as the stack is unwinding, ld_dlopenadvise(), in ltdl.c:1664 *also* calls:

  /* Still here?  Then we really did fail to locate any of the file
     names we tried.  */
  return 0;

Which also sets the last error reported string to "file not found". But this seems clearly wrong (at least in this case): the fact that we're falling out through the error case in lt_dlopenadvise() does *not* indicate that the file was not found -- it just means that nothing was successfully loaded. The real error can be (and is, in this case) something else.

I realize that this is somewhat complex issue because libltdl have a generic loader engine and it's just reporting the "last" error. So I don't know what the right solution is, but from a the perspective of someone who is using libltdl, I would much rather have the "missing symbol" error reported rather than the misleading "file not found" [non-]error.

FWIW: prior versions of libltdl *did* report the "missing symbol" error properly, so one could actually consider this a regression against prior behavior.

Jeff Squyres
Cisco Systems

reply via email to

[Prev in Thread] Current Thread [Next in Thread]