|
From: | Jacob Bachmeyer |
Subject: | bug#47795: utils.test fails intermittently |
Date: | Thu, 29 Apr 2021 22:15:53 -0500 |
User-agent: | Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.22) Gecko/20090807 MultiZilla/1.8.3.4e SeaMonkey/1.1.17 Mnenhy/0.7.6.0 |
Rainer Orth wrote:
I just noticed that one part of the runtest testsuite fails intermittently on both Solaris 10 and 11 on the 1.6.3 branch. It's easy to miss because it only manifests itself in runtest.{sum,log} as [...] PASS: testsuite file creates new implied object directory ERROR: /vol/src/gnu/dejagnu/dejagnu-1.6.3-branch/local/testsuite/runtest.libs/utils.test did not complete and === runtest Summary ===-# of expected passes 268+# of expected passes 294 Unless you know how many passes to expect, you likely miss the problem.
This is a long-standing issue with DejaGnu: the framework does not know how many tests to expect, so missing tests just go, well, /missing/.
This is an architectural limitation in DejaGnu, so the closest to an actual solution while preserving backwards compatibility is to improve error handling to more reliably report that /something/ went wrong, even if we have no way to state how many tests were not run. Optional "test manifest" features, allowing testsuites to provide the framework with a list of expected tests, are on my local TODO list, but I have not yet found a good way to generalize them to DejaGnu proper. It is possible that some other code consequent to the test grouping features that will accompany the new XML output format in 1.7 will pave the way to a clean API for this, but I am not yet sure of that.
Even when running runtest -v -v instead, neither the runtest output @@ -157,8 +157,7 @@ Processing test /vol/src/gnu/dejagnu/dejagnu-1.6.3-branch/local/testsuite/runtest.libs/testsuite_file.test Running /vol/src/gnu/dejagnu/dejagnu-1.6.3-branch/local/testsuite/runtest.libs/utils.test ... Processing test /vol/src/gnu/dejagnu/dejagnu-1.6.3-branch/local/testsuite/runtest.libs/utils.test -WARNED: diff1.txt doesn't exist - +ERROR: /vol/src/gnu/dejagnu/dejagnu-1.6.3-branch/local/testsuite/runtest.libs/utils.test did not complete nor runtest.log show much: Running /vol/src/gnu/dejagnu/dejagnu-1.6.3-branch/local/testsuite/runtest.libs/utils.test ... Processing test /vol/src/gnu/dejagnu/dejagnu-1.6.3-branch/local/testsuite/runtest.libs/utils.test interp delete test_case interp create test_case interp create test_case expect1.30> test_case eval {foreach { n v }} [list [array get default_vars]] {{ set $n $v }} test_case eval {foreach { n v }} [list [array get default_vars]] {{ set $n $v }} test_case^M^M expect1.31> test_case eval source /vol/src/gnu/dejagnu/dejagnu-1.6.3-branch/local/testsuite/runtest.libs/utils.test test_case eval source /vol/src/gnu/dejagnu/dejagnu-1.6.3-branch/local/testsuite/runtest.libs/utils.test expect1.32> ERROR: /vol/src/gnu/dejagnu/dejagnu-1.6.3-branch/local/testsuite/runtest.libs/utils.test did not complete The test does succeed when running runtest --tool runtest in a row about 2 out of 10 times.
There is a possible clue here: it appears that we may be losing synchronization: the "interp create test_case" line should be emitted in response to an expect1.*> prompt which seems to be missing in your log, yet the preceding "interp delete test_case" is included. Do I guess correctly that the line preceding that snippet was another "interp delete test_case"? Particularly interesting in that your log seems to show "test_case" returned for the "test_case eval ... foreach ..." line, but that line should not produce a response -- the "test_case" response should be from the previous "interp create test_case" command.
In fact, examining process_test in testsuite/runtest.libs/libs.exp which drives these tests, the error that you are seeing is exactly what is produced when an Expect prompt arrives when the driver is expecting to see test results. Losing synchronization could cause this, although that check was added to detect the case where the slave interpreter aborts due to a Tcl error in the inner test script.
Assuming the "lost sync" hypothesis, when the failure occurs, the previous command is still "in-flight" somewhere... and the "1.32" prompt is when the "test_case eval source ..." command is given in my logs... but your logs show it being issued at the "1.31" prompt... is an "expect1./N/>" prompt duplicated somewhere earlier in the log?
It is also possible that an earlier "expect" (command in process_test) is timing out while waiting for an "expect*>" prompt from the child Expect. My development box is relatively lightly loaded; what are the typical load averages on the system where the problem occurs?
The code in testsuite/runtest.libs/libs.exp is mostly my doing at this point, and is probably some of the earlier Expect code that I have written. If there is a bug in there that only affects heavily-loaded systems, it is probably my then-inexperience showing. :-)
I'm a bit at a loss how to investigate this further.
I have only now found a possible explanation, after sitting on this for about two weeks.
-- Jacob
[Prev in Thread] | Current Thread | [Next in Thread] |