[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: s390x TCG migration failure
From: |
Nina Schoetterl-Glausch |
Subject: |
Re: s390x TCG migration failure |
Date: |
Thu, 13 Apr 2023 13:42:49 +0200 |
User-agent: |
Evolution 3.46.4 (3.46.4-1.fc37) |
On Wed, 2023-04-12 at 23:01 +0200, Juan Quintela wrote:
> Nina Schoetterl-Glausch <nsg@linux.ibm.com> wrote:
> > Hi,
> >
> > We're seeing failures running s390x migration kvm-unit-tests tests with TCG.
>
> As this is tcg, could you tell the exact command that you are running?
> Does it needs to be in s390x host, rigth?
I've just tried with a cross compile of kvm-unit-tests and that fails, too.
git clone https://gitlab.com/kvm-unit-tests/kvm-unit-tests.git
cd kvm-unit-tests/
./configure --cross-prefix=s390x-linux-gnu- --arch=s390x
make
for i in {0..30}; do echo $i; QEMU=../qemu/build/qemu-system-s390x ACCEL=tcg
./run_tests.sh migration-skey-sequential | grep FAIL && break; done
>
> $ time ./tests/qtest/migration-test
I haven't looked if that test fails at all, we just noticed it with the
kvm-unit-tests.
> # random seed: R02S940c4f22abc48b14868566639d3d6c77
> # Skipping test: s390x host with KVM is required
> 1..0
>
> real 0m0.003s
> user 0m0.002s
> sys 0m0.001s
>
>
> > Some initial findings:
> > What seems to be happening is that after migration a control block
> > header accessed by the test code is all zeros which causes an
> > unexpected exception.
>
> What exception?
>
> What do you mean here by control block header?
It's all s390x test guest specific stuff, I don't expect it to be too helpful.
The guest gets a specification exception program interrupt while executing a
SERVC because
the SCCB control block is invalid.
See https://gitlab.com/qemu-project/qemu/-/issues/1565 for a code snippet.
The guest sets a bunch of fields in the SCCB header, but when TCG emulates the
SERVC,
they are zero which doesn't make sense.
>
> > I did a bisection which points to c8df4a7aef ("migration: Split
> > save_live_pending() into state_pending_*") as the culprit.
> > The migration issue persists after applying the fix e264705012 ("migration:
> > I messed state_pending_exact/estimate") on top of c8df4a7aef.
> >
> > Applying
> >
> > diff --git a/migration/ram.c b/migration/ram.c
> > index 56ff9cd29d..2dc546cf28 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -3437,7 +3437,7 @@ static void ram_state_pending_exact(void *opaque,
> > uint64_t max_size,
> >
> > uint64_t remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE;
> >
> > - if (!migration_in_postcopy()) {
> > + if (!migration_in_postcopy() && remaining_size < max_size) {
>
> If block is all zeros, then remaining_size should be zero, so always
> smaller than max_size.
>
> I don't really fully understand what is going here.
>
> > qemu_mutex_lock_iothread();
> > WITH_RCU_READ_LOCK_GUARD() {
> > migration_bitmap_sync_precopy(rs);
> >
> > on top fixes or hides the issue. (The comparison was removed by c8df4a7aef.)
> > I arrived at this by experimentation, I haven't looked into why this makes
> > a difference.
> >
> > Any thoughts on the matter appreciated.
>
> Later, Juan.
>