qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [Qemu-block] Aborts in iotest 169


From: Kevin Wolf
Subject: Re: [Qemu-devel] [Qemu-block] Aborts in iotest 169
Date: Thu, 24 Jan 2019 11:15:59 +0100
User-agent: Mutt/1.10.1 (2018-07-13)

Am 24.01.2019 um 10:29 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 23.01.2019 18:48, Max Reitz wrote:
> > Hi,
> > 
> > When running 169 in parallel (e.g. like so:
> > 
> > $ while TEST_DIR=/tmp/t0 ./check -T -qcow2 169; do; done
> > $ while TEST_DIR=/tmp/t1 ./check -T -qcow2 169; do; done
> > $ while TEST_DIR=/tmp/t2 ./check -T -qcow2 169; do; done
> > $ while TEST_DIR=/tmp/t3 ./check -T -qcow2 169; do; done
> > 
> > in four different shells), I get aborts:
> > 
> > (Often I get segfaults, but that's because of
> > http://lists.nongnu.org/archive/html/qemu-devel/2018-12/msg05579.html --
> > feel free to apply the attached patch to make them go away)
> > 
> > 
> > WARNING:qemu:qemu received signal 6:
> > build/tests/qemu-iotests/../../x86_64-softmmu/qemu-system-x86_64
> > -chardev socket,id=mon,path=/tmp/t0/tmpbX30XU/qemua-25745-monitor.sock
> > -mon chardev=mon,mode=control -display none -vga none -qtest
> > unix:path=/tmp/t0/qemua-25745-qtest.sock -machine accel=qtest
> > -nodefaults -machine accel=qtest -drive
> > if=virtio,id=drive0,file=/tmp/t0/disk_a,format=qcow2,cache=writeback
> > .................E..
> > ======================================================================
> > ERROR:
> > test_do_test_migration_resume_source_not_persistent__not_migbitmap
> > (__main__.TestDirtyBitmapMigration)
> > ----------------------------------------------------------------------
> > Traceback (most recent call last):
> >    File "169", line 206, in <lambda>
> >      setattr(klass, 'test_' + method + name, lambda self: mc(self))
> >    File "169", line 113, in do_test_migration_resume_source
> >      self.check_bitmap(self.vm_a, sha256)
> >    File "169", line 72, in check_bitmap
> >      node='drive0', name='bitmap0')
> >    File "tests/qemu-iotests/../../scripts/qemu.py", line 369, in qmp
> >      return self._qmp.cmd(cmd, args=qmp_args)
> >    File "tests/qemu-iotests/../../scripts/qmp/qmp.py", line 191, in cmd
> >      return self.cmd_obj(qmp_cmd)
> >    File "tests/qemu-iotests/../../scripts/qmp/qmp.py", line 174, in cmd_obj
> >      resp = self.__json_read()
> >    File "tests/qemu-iotests/../../scripts/qmp/qmp.py", line 82, in
> > __json_read
> >      data = self.__sockfile.readline()
> >    File "/usr/lib64/python2.7/socket.py", line 451, in readline
> >      data = self._sock.recv(self._rbufsize)
> > error: [Errno 104] Connection reset by peer
> > 
> > ----------------------------------------------------------------------
> > Ran 20 tests
> > 
> > FAILED (errors=1)
> > 
> > 
> > Or:
> > 
> > WARNING:qemu:qemu received signal 6:
> > build/tests/qemu-iotests/../../x86_64-softmmu/qemu-system-x86_64
> > -chardev socket,id=mon,path=/tmp/t3/tmp0pllWD/qemua-3445-monitor.sock
> > -mon chardev=mon,mode=control -display none -vga none -qtest
> > unix:path=/tmp/t3/qemua-3445-qtest.sock -machine accel=qtest -nodefaults
> > -machine accel=qtest -drive
> > if=virtio,id=drive0,file=/tmp/t3/disk_a,format=qcow2,cache=writeback
> > WARNING:qemu:qemu received signal 6:
> > build/tests/qemu-iotests/../../x86_64-softmmu/qemu-system-x86_64
> > -chardev socket,id=mon,path=/tmp/t3/tmp0pllWD/qemua-3445-monitor.sock
> > -mon chardev=mon,mode=control -display none -vga none -qtest
> > unix:path=/tmp/t3/qemua-3445-qtest.sock -machine accel=qtest -nodefaults
> > -machine accel=qtest -drive
> > if=virtio,id=drive0,file=/tmp/t3/disk_a,format=qcow2,cache=writeback
> > 
> > ...................F
> > ======================================================================
> > FAIL: test_do_test_migration_resume_source_persistent__not_migbitmap
> > (__main__.TestDirtyBitmapMigration)
> > ----------------------------------------------------------------------
> > Traceback (most recent call last):
> >    File "169", line 206, in <lambda>
> >      setattr(klass, 'test_' + method + name, lambda self: mc(self))
> >    File "169", line 125, in do_test_migration_resume_source
> >      self.assertEqual(log, '')
> > AssertionError: "qemu-system-x86_64: invalid runstate transition:
> > 'running' -> 'postmigrate'\n" != ''
> > 
> > ----------------------------------------------------------------------
> > Ran 20 tests
> > 
> > FAILED (failures=1)
> > 
> > 
> > The backtrace always goes like this:
> > 
> > (gdb) bt
> > #0  0x00007f0acf5cc53f in raise () at /lib64/libc.so.6
> > #1  0x00007f0acf5b6895 in abort () at /lib64/libc.so.6
> > #2  0x000055a46ebbb1a6 in runstate_set (new_state=RUN_STATE_POSTMIGRATE)
> > at vl.c:742
> > #3  0x000055a46ebbb1a6 in runstate_set
> > (address@hidden) at vl.c:730
> > #4  0x000055a46ed39129 in migration_iteration_finish (s=0x55a4708be000)
> > at migration/migration.c:2972
> > #5  0x000055a46ed39129 in migration_thread
> > (address@hidden) at migration/migration.c:3130
> > #6  0x000055a46eea665a in qemu_thread_start (args=<optimized out>) at
> > util/qemu-thread-posix.c:502
> > 
> > 
> > #7  0x00007f0acf76258e in start_thread () at /lib64/libpthread.so.0
> > #8  0x00007f0acf6916a3 in clone () at /lib64/libc.so.6
> > (gdb) frame 2
> > #2  0x000055a46ebbb1a6 in runstate_set (new_state=RUN_STATE_POSTMIGRATE)
> > at vl.c:742
> > 742             abort();
> > (gdb) print current_run_state
> > $1 = RUN_STATE_RUNNING
> > 
> > 
> > Neither of migration or runstates are my strong suite, so I thought I'd
> > report it before diving into it.
> > 
> > Max
> > 
> 
> [...]
> address@hidden:migrate_set_state new state active
> address@hidden:migration_thread_setup_complete
> address@hidden:migrate_transferred transferred 985298 time_spent 100 
> bandwidth 9852 max_size 2955894
> address@hidden:migration_bitmap_sync_start
> address@hidden:migration_bitmap_sync_end dirty_pages 32898
> address@hidden:migration_thread_low_pending 2048
> migration_completion
> address@hidden:runstate_set current_state 9 new_state 7
> address@hidden:migration_bitmap_sync_start
> address@hidden:migration_bitmap_sync_end dirty_pages 32898
> address@hidden:migrate_global_state_pre_save saved state: running
> [1] address@hidden:migrate_set_state new state completed
> address@hidden:handle_qmp_command mon 0x55aebef99830 req: {"execute": 
> "x-debug-block-dirty-bitmap-sha256", "arguments": {"name": "bitmap0", "node": 
> "drive0"}}
> address@hidden:migration_thread_after_loop
> [2] address@hidden:handle_qmp_command mon 0x55aebef99830 req: {"execute": 
> "cont"}
> address@hidden:runstate_set current_state 7 new_state 9
> qemu-system-x86_64: invalid runstate transition: 'running' -> 'postmigrate'
> 
> 
> Aha, so, in test we wait for MIGRATION COMPLETED [1] and run qmp_cont [2] and 
> go to RUNNING.
> then in migration thread we go to migration_iteration_finish() and fail to go 
> to POSTMIGRATE.
> (note, that this testcase is about resuming source after migration)
> 
> So, fix for test may be additionally wait for POSTMIGRATE, not only for 
> MIGRATION COMPLETION.
> 
> But how to fix Qemu not to crash? May be, forbid some transitions 
> (FINISH_MIGRATE -> RUNNING),
>   or at least error-out qmp_cont if runstate is  FINISH_MIGRATE?

qmp_cont currently checks for RUN_STATE_INMIGRATE and gives it special
treatment. As a quick hack, doing the same for RUN_STATE_FINISH_MIGRATE
might fix the problem. A better simple fix would possibly include a
whitelist of states where you can use 'cont' to avoid that the user
messes with other internal states.

However, like the problem that Dave reported a few days ago (migrating
twice leads to a crash because we try to inactivate already inactive
block devices), I think this is a symptom of a larger problem. We're
not taking the state machine serious enough. One-off checks here and
there are unlikely to ever fully solve the problem of running a command
in a runstate in which it was never supposed to run.

I wonder whether the QAPI schema should have a field 'run-states' for
commands, and by default we would only include states where the VM has
ownership of its resources (e.g. images are activated) and which are not
temporary states that are automatically left, like finish-migrate.

Then the default for commands is to be rejected in "unusual" runstates
where we're not expecting user intervention, and we must explicitly
allow them if they are okay, in fact.

Instead of listing every obscure runstate, maybe we should really use
categories of runstates instead:

1. Running
2. Paused, owns all resources (like disk images)
3. Paused, doesn't own some resources (source VM after migration
   completes, destination before migration completes)
4. Paused temporarily for internal reasons (e.g. finish-migrate,
   restore-vm, save-vm)

Most commands should be okay with 1 and 2, but possibly not 3, and
almost never 4.

Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]