[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gnumed-devel] (no subject)

From: syan tan
Subject: [Gnumed-devel] (no subject)
Date: Mon, 22 Sep 2003 20:42:46 +1000
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3) Gecko/20030313


Scary language bug (order of precedence) (given berkeley-db claims 2x 10^8 installations)

Another recent challenging problem occurred while run-
ning our test suite on an embedded system. A handful of
tests were taking an assertion while acquiring a self-
blocking mutex lock, because the locking code was
unexpectedly returning an EDEADLK error. This partic-
ular code is one of the few places where we use self-
blocking mutexes.
In DB, the code to allocate and initialize a mutex takes
an argument for flags. Some of the flags affect the
mutex, such as the one indicating that this is a self-
blocking mutex. Some affect the allocation code such as
one indicating whether the shared memory region
(where the mutex is allocated) needs locking or is
already locked by the calling function. Therefore, this
mutex code looked like this:
if (we need a new mutex){
    __db_mutex_setup(..., &m,
        (SELF_BLOCK | is_locked ?
        NO_LOCK : 0));
... MUTEX_LOCK(m);
It was the second call to MUTEX_LOCK that returned
the EDEADLK instead of blocking as expected. So,
why is this failing, and only on this one system and
nowhere else? The possibilities included:
1. This system uses pthread mutexes. Most systems
we have use test-and-set mutexes. Perhaps there was a
bug in our Pthread mutex code.
2. Since self-blocking mutexes are not frequently used,
perhaps we are hitting a bug in the system's pthread
3. It is something else.
Given that the test suite was only failing on this system
and no other in this way, our tendency was to think
option #2 was the most likely cause. Option #1 was a
possibility but that code is extremely stable in DB and
has been virtually unchanged for years.
Fortunately, we have a multi-threaded mutex test appli-
cation that directly calls the DB mutex code. After eas-
ily porting that to the embedded system, and many
successful runs, we concluded that the mutex code
worked as expected (both DB's and the system's) and
the failure must be due to option #3 above and we were
almost back where we started. Additional, fairly painful,
debugging yielded the true bug, and it is in the code
snippet above. The bug was that the SELF_BLOCK flag
was never getting passed into the setup function, due to
a misplaced parenthesis and different precedence. The
correct code must read:
     if (we need a new mutex){
         __db_mutex_setup(..., &m,
               SELF_BLOCK |
             (is_locked ? NO_LOCK : 0));
Debugging on this particular embedded platform is not
very easy. So working through this problem was more
difficult than it would normally be. After working
through this problem a few questions needed to be
1. Why did this problem only show itself on this one
system and nowhere else? Almost all other systems use
test-and-set mutexes, which don't use the pthread code.
The test-and-set code ignores the SELF_BLOCK flag.
The other system we have using pthread mutexes
used a different code path.
2. What did we learn? The lessons learned here are that
it is important to run the test suite on every system pos-
sible and follow up vigorously with all problems. A few
times during this debugging, which took a couple of
days, we were ready to simply assume it was a system
problem and move on. Thankfully we resisted that urge.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]