Re: [PATCH v3 0/7] Switch iotests to using Async QMP

On Wed, Oct 13, 2021 at 10:49 AM Hanna Reitz <hreitz@redhat.com> wrote:

On 13.10.21 16:00, John Snow wrote:
>
>
> On Wed, Oct 13, 2021 at 8:51 AM John Snow <jsnow@redhat.com> wrote:
>
>
>
> On Wed, Oct 13, 2021 at 4:45 AM Hanna Reitz <hreitz@redhat.com> wrote:
>
> On 13.10.21 00:34, John Snow wrote:
> > Based-on: <20211012214152.802483-1-jsnow@redhat.com" target="_blank">20211012214152.802483-1-jsnow@redhat.com>
> > [PULL 00/10] Python patches
> > GitLab:
> https://gitlab.com/jsnow/qemu/-/commits/python-aqmp-iotest-wrapper
> > CI: https://gitlab.com/jsnow/qemu/-/pipelines/387210591
> >
> > Hiya,
> >
> > This series continues where the last two AQMP series left
> off and adds a
> > synchronous 'legacy' wrapper around the new AQMP interface,
> then drops
> > it straight into iotests to prove that AQMP is functional
> and totally
> > cool and fine. The disruption and churn to iotests is pretty
> minimal.
> >
> > In the event that a regression happens and I am not
> physically proximate
> > to inflict damage upon, one may set the
> QEMU_PYTHON_LEGACY_QMP variable
> > to any non-empty string as it pleases you to engage the QMP
> machinery
> > you are used to.
> >
> > I'd like to try and get this committed early in the 6.2
> development
> > cycle to give ample time to smooth over any possible
> regressions. I've
> > tested it locally and via gitlab CI, across Python versions
> 3.6 through
> > 3.10, and "worksforme". If something bad happens, we can
> revert the
> > actual switch-flip very trivially.
>
> So running iotests locally, I got one failure:
>
> $ TEST_DIR=/tmp/vdi-tests ./check -c writethrough -vdi 300
> [...]
> 300                             fail       [10:28:06] [10:28:11]
> 5.1s                 output mismatch (see 300.out.bad)
> --- /home/maxx/projects/qemu/tests/qemu-iotests/300.out
> +++ 300.out.bad
> @@ -1,4 +1,5 @@
> -.......................................
> +..............ERROR:qemu.aqmp.qmp_client.qemu-b-222963:Task.Reader:
>
> ConnectionResetError: [Errno 104] Connection reset by peer
> +.........................
>    ----------------------------------------------------------------------
>    Ran 39 tests
> [...]
>
>
> Oh, unfortunate.
>
>
> I’m afraid I can’t really give a reproducer or anything. It
> feels like
>
>
> Thank you for the report!
>
> just some random spurious timing-related error. Although then
> again,
> 300 does have an `except machine.AbnormalShutdown` clause at one
> point... So perhaps that’s the culprit, and we need to
> disable logging
> there.
>
>
> I'll investigate!
>
>
> Unfortunately, even in a loop some 150 times I couldn't reproduce this
> one. As you point out, it appears to be just a failure caused by
> logging. The test logic itself completes as expected.
>
> Still, I would expect, on a "clean" shutdown of the destination host
> (where the destination process fails to load the migration stream and
> voluntarily exits with an error code) to end with a FIN/ACK for TCP or
> ... uh, whatever happens for a UNIX socket. Where's the Connection
> Reset coming from? Did the destination VM process *crash*?
>
> I'm not so sure that I *should* silence this error, but I also can't
> reproduce it at all to answer these questions, so uh. uhhh. I guess I
> will just hammer it on a loop a few hundred times more and see if I
> get lucky.

I could reproduce it, by running 20 instances concurrently. (Needs a
change to testrunner.py, so that the reference outputs don’t collide:

diff --git a/tests/qemu-iotests/testrunner.py
b/tests/qemu-iotests/testrunner.py
index a56b6da396..fd0a3a1eeb 100644
--- a/tests/qemu-iotests/testrunner.py
+++ b/tests/qemu-iotests/testrunner.py
@@ -221,7 +221,7 @@ def find_reference(self, test: str) -> str:

      def do_run_test(self, test: str) -> TestResult:
          f_test = Path(test)
-        f_bad = Path(f_test.name + '.out.bad')
+        f_bad = Path(f'{os.getpid()}-{f_test.name}.out.bad')
          f_notrun = Path(f_test.name + '.notrun')
          f_casenotrun = Path(f_test.name + '.casenotrun')
          f_reference = Path(self.find_reference(test))

)

And then:

$ while TEST_DIR=/tmp/vdi-$$ ./check -vdi 300; do; done

Which pretty quickly shows the error in at least one of those loops
(under a minute).

As far as I can tell, changing the log level in 300 does indeed fix it:

diff --git a/tests/qemu-iotests/300 b/tests/qemu-iotests/300
index 10f9f2a8da..096f5dabf0 100755
--- a/tests/qemu-iotests/300
+++ b/tests/qemu-iotests/300
@@ -27,6 +27,7 @@ from typing import Dict, List, Optional
  from qemu.machine import machine

  import iotests
+from iotests import change_log_level

  BlockBitmapMapping = List[Dict[str, object]]
@@ -464,7 +465,8 @@ class
TestBlockBitmapMappingErrors(TestDirtyBitmapMigration):
          # Expect abnormal shutdown of the destination VM because of
          # the failed migration
          try:
-            self.vm_b.shutdown()
+            with change_log_level('qemu.aqmp'):
+                self.vm_b.shutdown()
          except machine.AbnormalShutdown:
              pass

Thanks for this! That helped a lot.

Long story short, what is happening here is that in some cases, we manage to issue a 'quit' command before we acknowledge the EOF from the target VM, so we get an 'ExecInterruptedError' from the top half and ECONNRESET from the bottom half.

This gives me some faith that there's no more sinister problem going on here.

There's really no way to "universally" solve this, though: It's always going to be a problem that *maybe* sometimes when we go to send a command that, no matter how recently we checked, the peer appeared to be online and then upon sending a message we found out that isn't true.

However, there's a cuter way to solve this particular instance: just change test 300 to wait() for the VM to self-exit instead of trying to shut it down ourselves. That avoids the race condition in exactly what type of error message we find when shutting down a dead-or-dying VM.

Thanks for the assistance on debugging this one.

--js

From:	John Snow
Subject:	Re: [PATCH v3 0/7] Switch iotests to using Async QMP
Date:	Wed, 13 Oct 2021 12:35:27 -0400