Re: s390x TCG migration failure

qemu-block

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: s390x TCG migration failure

From:	Nina Schoetterl-Glausch
Subject:	Re: s390x TCG migration failure
Date:	Thu, 13 Apr 2023 13:42:49 +0200
User-agent:	Evolution 3.46.4 (3.46.4-1.fc37)

On Wed, 2023-04-12 at 23:01 +0200, Juan Quintela wrote:
> Nina Schoetterl-Glausch <nsg@linux.ibm.com> wrote:
> > Hi,
> > 
> > We're seeing failures running s390x migration kvm-unit-tests tests with TCG.
> 
> As this is tcg, could you tell the exact command that you are running?
> Does it needs to be in s390x host, rigth?

I've just tried with a cross compile of kvm-unit-tests and that fails, too.

git clone https://gitlab.com/kvm-unit-tests/kvm-unit-tests.git
cd kvm-unit-tests/
./configure --cross-prefix=s390x-linux-gnu- --arch=s390x
make
for i in {0..30}; do echo $i; QEMU=../qemu/build/qemu-system-s390x ACCEL=tcg 
./run_tests.sh migration-skey-sequential | grep FAIL && break; done

> 
> $ time ./tests/qtest/migration-test

I haven't looked if that test fails at all, we just noticed it with the 
kvm-unit-tests.

> # random seed: R02S940c4f22abc48b14868566639d3d6c77
> # Skipping test: s390x host with KVM is required
> 1..0
> 
> real  0m0.003s
> user  0m0.002s
> sys   0m0.001s
> 
> 
> > Some initial findings:
> > What seems to be happening is that after migration a control block
> > header accessed by the test code is all zeros which causes an
> > unexpected exception.
> 
> What exception?
> 
> What do you mean here by control block header?

It's all s390x test guest specific stuff, I don't expect it to be too helpful.
The guest gets a specification exception program interrupt while executing a 
SERVC because
the SCCB control block is invalid.

See https://gitlab.com/qemu-project/qemu/-/issues/1565 for a code snippet.
The guest sets a bunch of fields in the SCCB header, but when TCG emulates the 
SERVC,
they are zero which doesn't make sense.

> 
> > I did a bisection which points to c8df4a7aef ("migration: Split 
> > save_live_pending() into state_pending_*") as the culprit.
> > The migration issue persists after applying the fix e264705012 ("migration: 
> > I messed state_pending_exact/estimate") on top of c8df4a7aef.
> > 
> > Applying
> > 
> > diff --git a/migration/ram.c b/migration/ram.c
> > index 56ff9cd29d..2dc546cf28 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -3437,7 +3437,7 @@ static void ram_state_pending_exact(void *opaque, 
> > uint64_t max_size,
> >  
> >      uint64_t remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE;
> >  
> > -    if (!migration_in_postcopy()) {
> > +    if (!migration_in_postcopy() && remaining_size < max_size) {
> 
> If block is all zeros, then remaining_size should be zero, so always
> smaller than max_size.
> 
> I don't really fully understand what is going here.
> 
> >          qemu_mutex_lock_iothread();
> >          WITH_RCU_READ_LOCK_GUARD() {
> >              migration_bitmap_sync_precopy(rs);
> > 
> > on top fixes or hides the issue. (The comparison was removed by c8df4a7aef.)
> > I arrived at this by experimentation, I haven't looked into why this makes 
> > a difference.
> > 
> > Any thoughts on the matter appreciated.
> 
> Later, Juan.
>

[Prev in Thread]

Current Thread

[Next in Thread]

Re: s390x TCG migration failure, Thomas Huth, 2023/04/04
- Re: s390x TCG migration failure, Juan Quintela, 2023/04/12
- Re: s390x TCG migration failure, Juan Quintela, 2023/04/12
- Re: s390x TCG migration failure, Juan Quintela, 2023/04/12
  - Re: s390x TCG migration failure, Nina Schoetterl-Glausch <=

Prev by Date: RE: [PATCH] replication: compile out some staff when replication is not configured
Next by Date: Re: [PATCH] replication: compile out some staff when replication is not configured
Previous by thread: Re: s390x TCG migration failure
Next by thread: [PATCH v8 0/4] Add zone append write for zoned device
Index(es):
- Date
- Thread