[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [Qemu-block] Aborts in iotest 169
From: |
Dr. David Alan Gilbert |
Subject: |
Re: [Qemu-devel] [Qemu-block] Aborts in iotest 169 |
Date: |
Thu, 24 Jan 2019 20:15:22 +0000 |
User-agent: |
Mutt/1.10.1 (2018-07-13) |
* Kevin Wolf (address@hidden) wrote:
> Am 24.01.2019 um 11:49 hat Dr. David Alan Gilbert geschrieben:
> > * Kevin Wolf (address@hidden) wrote:
> > > Am 24.01.2019 um 10:29 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > > > 23.01.2019 18:48, Max Reitz wrote:
> > > > > Hi,
> > > > >
> > > > > When running 169 in parallel (e.g. like so:
> > > > >
> > > > > $ while TEST_DIR=/tmp/t0 ./check -T -qcow2 169; do; done
> > > > > $ while TEST_DIR=/tmp/t1 ./check -T -qcow2 169; do; done
> > > > > $ while TEST_DIR=/tmp/t2 ./check -T -qcow2 169; do; done
> > > > > $ while TEST_DIR=/tmp/t3 ./check -T -qcow2 169; do; done
> > > > >
> > > > > in four different shells), I get aborts:
> > > > >
> > > > > (Often I get segfaults, but that's because of
> > > > > http://lists.nongnu.org/archive/html/qemu-devel/2018-12/msg05579.html
> > > > > --
> > > > > feel free to apply the attached patch to make them go away)
> > > > >
> > > > >
> > > > > WARNING:qemu:qemu received signal 6:
> > > > > build/tests/qemu-iotests/../../x86_64-softmmu/qemu-system-x86_64
> > > > > -chardev socket,id=mon,path=/tmp/t0/tmpbX30XU/qemua-25745-monitor.sock
> > > > > -mon chardev=mon,mode=control -display none -vga none -qtest
> > > > > unix:path=/tmp/t0/qemua-25745-qtest.sock -machine accel=qtest
> > > > > -nodefaults -machine accel=qtest -drive
> > > > > if=virtio,id=drive0,file=/tmp/t0/disk_a,format=qcow2,cache=writeback
> > > > > .................E..
> > > > > ======================================================================
> > > > > ERROR:
> > > > > test_do_test_migration_resume_source_not_persistent__not_migbitmap
> > > > > (__main__.TestDirtyBitmapMigration)
> > > > > ----------------------------------------------------------------------
> > > > > Traceback (most recent call last):
> > > > > File "169", line 206, in <lambda>
> > > > > setattr(klass, 'test_' + method + name, lambda self: mc(self))
> > > > > File "169", line 113, in do_test_migration_resume_source
> > > > > self.check_bitmap(self.vm_a, sha256)
> > > > > File "169", line 72, in check_bitmap
> > > > > node='drive0', name='bitmap0')
> > > > > File "tests/qemu-iotests/../../scripts/qemu.py", line 369, in qmp
> > > > > return self._qmp.cmd(cmd, args=qmp_args)
> > > > > File "tests/qemu-iotests/../../scripts/qmp/qmp.py", line 191, in
> > > > > cmd
> > > > > return self.cmd_obj(qmp_cmd)
> > > > > File "tests/qemu-iotests/../../scripts/qmp/qmp.py", line 174, in
> > > > > cmd_obj
> > > > > resp = self.__json_read()
> > > > > File "tests/qemu-iotests/../../scripts/qmp/qmp.py", line 82, in
> > > > > __json_read
> > > > > data = self.__sockfile.readline()
> > > > > File "/usr/lib64/python2.7/socket.py", line 451, in readline
> > > > > data = self._sock.recv(self._rbufsize)
> > > > > error: [Errno 104] Connection reset by peer
> > > > >
> > > > > ----------------------------------------------------------------------
> > > > > Ran 20 tests
> > > > >
> > > > > FAILED (errors=1)
> > > > >
> > > > >
> > > > > Or:
> > > > >
> > > > > WARNING:qemu:qemu received signal 6:
> > > > > build/tests/qemu-iotests/../../x86_64-softmmu/qemu-system-x86_64
> > > > > -chardev socket,id=mon,path=/tmp/t3/tmp0pllWD/qemua-3445-monitor.sock
> > > > > -mon chardev=mon,mode=control -display none -vga none -qtest
> > > > > unix:path=/tmp/t3/qemua-3445-qtest.sock -machine accel=qtest
> > > > > -nodefaults
> > > > > -machine accel=qtest -drive
> > > > > if=virtio,id=drive0,file=/tmp/t3/disk_a,format=qcow2,cache=writeback
> > > > > WARNING:qemu:qemu received signal 6:
> > > > > build/tests/qemu-iotests/../../x86_64-softmmu/qemu-system-x86_64
> > > > > -chardev socket,id=mon,path=/tmp/t3/tmp0pllWD/qemua-3445-monitor.sock
> > > > > -mon chardev=mon,mode=control -display none -vga none -qtest
> > > > > unix:path=/tmp/t3/qemua-3445-qtest.sock -machine accel=qtest
> > > > > -nodefaults
> > > > > -machine accel=qtest -drive
> > > > > if=virtio,id=drive0,file=/tmp/t3/disk_a,format=qcow2,cache=writeback
> > > > >
> > > > > ...................F
> > > > > ======================================================================
> > > > > FAIL: test_do_test_migration_resume_source_persistent__not_migbitmap
> > > > > (__main__.TestDirtyBitmapMigration)
> > > > > ----------------------------------------------------------------------
> > > > > Traceback (most recent call last):
> > > > > File "169", line 206, in <lambda>
> > > > > setattr(klass, 'test_' + method + name, lambda self: mc(self))
> > > > > File "169", line 125, in do_test_migration_resume_source
> > > > > self.assertEqual(log, '')
> > > > > AssertionError: "qemu-system-x86_64: invalid runstate transition:
> > > > > 'running' -> 'postmigrate'\n" != ''
> > > > >
> > > > > ----------------------------------------------------------------------
> > > > > Ran 20 tests
> > > > >
> > > > > FAILED (failures=1)
> > > > >
> > > > >
> > > > > The backtrace always goes like this:
> > > > >
> > > > > (gdb) bt
> > > > > #0 0x00007f0acf5cc53f in raise () at /lib64/libc.so.6
> > > > > #1 0x00007f0acf5b6895 in abort () at /lib64/libc.so.6
> > > > > #2 0x000055a46ebbb1a6 in runstate_set
> > > > > (new_state=RUN_STATE_POSTMIGRATE)
> > > > > at vl.c:742
> > > > > #3 0x000055a46ebbb1a6 in runstate_set
> > > > > (address@hidden) at vl.c:730
> > > > > #4 0x000055a46ed39129 in migration_iteration_finish
> > > > > (s=0x55a4708be000)
> > > > > at migration/migration.c:2972
> > > > > #5 0x000055a46ed39129 in migration_thread
> > > > > (address@hidden) at migration/migration.c:3130
> > > > > #6 0x000055a46eea665a in qemu_thread_start (args=<optimized out>) at
> > > > > util/qemu-thread-posix.c:502
> > > > >
> > > > >
> > > > > #7 0x00007f0acf76258e in start_thread () at /lib64/libpthread.so.0
> > > > > #8 0x00007f0acf6916a3 in clone () at /lib64/libc.so.6
> > > > > (gdb) frame 2
> > > > > #2 0x000055a46ebbb1a6 in runstate_set
> > > > > (new_state=RUN_STATE_POSTMIGRATE)
> > > > > at vl.c:742
> > > > > 742 abort();
> > > > > (gdb) print current_run_state
> > > > > $1 = RUN_STATE_RUNNING
> > > > >
> > > > >
> > > > > Neither of migration or runstates are my strong suite, so I thought
> > > > > I'd
> > > > > report it before diving into it.
> > > > >
> > > > > Max
> > > > >
> > > >
> > > > [...]
> > > > address@hidden:migrate_set_state new state active
> > > > address@hidden:migration_thread_setup_complete
> > > > address@hidden:migrate_transferred transferred 985298 time_spent 100
> > > > bandwidth 9852 max_size 2955894
> > > > address@hidden:migration_bitmap_sync_start
> > > > address@hidden:migration_bitmap_sync_end dirty_pages 32898
> > > > address@hidden:migration_thread_low_pending 2048
> > > > migration_completion
> > > > address@hidden:runstate_set current_state 9 new_state 7
> > > > address@hidden:migration_bitmap_sync_start
> > > > address@hidden:migration_bitmap_sync_end dirty_pages 32898
> > > > address@hidden:migrate_global_state_pre_save saved state: running
> > > > [1] address@hidden:migrate_set_state new state completed
> > > > address@hidden:handle_qmp_command mon 0x55aebef99830 req: {"execute":
> > > > "x-debug-block-dirty-bitmap-sha256", "arguments": {"name": "bitmap0",
> > > > "node": "drive0"}}
> > > > address@hidden:migration_thread_after_loop
> > > > [2] address@hidden:handle_qmp_command mon 0x55aebef99830 req:
> > > > {"execute": "cont"}
> > > > address@hidden:runstate_set current_state 7 new_state 9
> > > > qemu-system-x86_64: invalid runstate transition: 'running' ->
> > > > 'postmigrate'
> > > >
> > > >
> > > > Aha, so, in test we wait for MIGRATION COMPLETED [1] and run qmp_cont
> > > > [2] and go to RUNNING.
> > > > then in migration thread we go to migration_iteration_finish() and fail
> > > > to go to POSTMIGRATE.
> > > > (note, that this testcase is about resuming source after migration)
> >
> > Ah that makes sense as to why we failed.
> >
> > > > So, fix for test may be additionally wait for POSTMIGRATE, not only for
> > > > MIGRATION COMPLETION.
> > > >
> > > > But how to fix Qemu not to crash? May be, forbid some transitions
> > > > (FINISH_MIGRATE -> RUNNING),
> > > > or at least error-out qmp_cont if runstate is FINISH_MIGRATE?
> > >
> > > qmp_cont currently checks for RUN_STATE_INMIGRATE and gives it special
> > > treatment. As a quick hack, doing the same for RUN_STATE_FINISH_MIGRATE
> > > might fix the problem. A better simple fix would possibly include a
> > > whitelist of states where you can use 'cont' to avoid that the user
> > > messes with other internal states.
> > >
> > > However, like the problem that Dave reported a few days ago (migrating
> > > twice leads to a crash because we try to inactivate already inactive
> > > block devices), I think this is a symptom of a larger problem. We're
> > > not taking the state machine serious enough. One-off checks here and
> > > there are unlikely to ever fully solve the problem of running a command
> > > in a runstate in which it was never supposed to run.
> > >
> > > I wonder whether the QAPI schema should have a field 'run-states' for
> > > commands, and by default we would only include states where the VM has
> > > ownership of its resources (e.g. images are activated) and which are not
> > > temporary states that are automatically left, like finish-migrate.
> > >
> > > Then the default for commands is to be rejected in "unusual" runstates
> > > where we're not expecting user intervention, and we must explicitly
> > > allow them if they are okay, in fact.
> > >
> > > Instead of listing every obscure runstate, maybe we should really use
> > > categories of runstates instead:
> > >
> > > 1. Running
> > > 2. Paused, owns all resources (like disk images)
> > > 3. Paused, doesn't own some resources (source VM after migration
> > > completes, destination before migration completes)
> > > 4. Paused temporarily for internal reasons (e.g. finish-migrate,
> > > restore-vm, save-vm)
> > >
> > > Most commands should be okay with 1 and 2, but possibly not 3, and
> > > almost never 4.
> >
> > The problem is it's tricky to get those groupings right and to figure
> > out how they all interact. For example it's perfectly fine for most
> > query type commands to run in (3) and (4), and you need a bunch of
> > commands to let you recover from failures that might leave you in (3)
> > and cont is one of those that you might use after a particular form
> > of failed migration.
>
> I'm not saying it is always easy, though I expect that many cases will
> actually be rather obvious.
>
> 'cont' will be one of the commands that need to work in (3), but it
> needs internal checks to make sure it behaves correctly. This is fine,
> if we declare in the schema that a command works in (3), we'll also
> have reviewed that it can actually handle the situation correctly.
Maybe; but we've got deeper problems that many of the other runstate
changes aren't safe and other qmp commands aren't safe during parts
of migration. About the only safety system we have is the BQL and
we can't take that around the whole of migration; so we've got this
dance we do during the end of migration where we take the lock some of
the time (no one I've asked can tell me why) but there are places
where things can jump in.
Interestingly the migration state machine uses a 'set_state' call
that most of the time is called with the state you expect to be
in, e.g.:
migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE,
MIGRATION_STATUS_COMPLETED);
only causes a transition from active->completed and you can check
the old state if you want. This is safer than what we use
for runstates, because it avoids a lot of the odder arbitrary
transitions.
Dave
> I think being explicit about the expected runstates for each command
> would already go a long way in addressing this class of bugs, because it
> reminds us to even think about unusual runstates.
>
> Kevin
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK
- Re: [Qemu-devel] Aborts in iotest 169, (continued)
- Re: [Qemu-devel] [Qemu-block] Aborts in iotest 169, Kevin Wolf, 2019/01/24
- Re: [Qemu-devel] [Qemu-block] Aborts in iotest 169, Dr. David Alan Gilbert, 2019/01/24
- Re: [Qemu-devel] [Qemu-block] Aborts in iotest 169, Kevin Wolf, 2019/01/24
- Re: [Qemu-devel] [Qemu-block] Aborts in iotest 169,
Dr. David Alan Gilbert <=
- Re: [Qemu-devel] [Qemu-block] Aborts in iotest 169, Vladimir Sementsov-Ogievskiy, 2019/01/24
- Re: [Qemu-devel] [Qemu-block] Aborts in iotest 169, Eric Blake, 2019/01/24