qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Timeouts in CI jobs


From: Stefan Weil
Subject: Re: Timeouts in CI jobs
Date: Wed, 24 Apr 2024 20:10:19 +0200
User-agent: Mozilla Thunderbird

Am 24.04.24 um 19:09 schrieb Daniel P. Berrangé:

On Wed, Apr 24, 2024 at 06:27:58PM +0200, Stefan Weil via wrote:
I think the timeouts are caused by running too many parallel processes
during testing.

The CI uses parallel builds:

make -j$(expr $(nproc) + 1) all check-build $MAKE_CHECK_ARGS
Note that command is running both the compile and test phases of
the job. Overcommitting CPUs for the compile phase is a good
idea to keep CPUs busy while another process is waiting on
I/O, and is almost always safe todo.


Thank you for your answer.

Overcommitting for the build is safe, but in my experience the positive effect is typically very small on modern hosts with fast disk I/O and large buffer caches.

And there is also a negative impact because this requires scheduling with process switches.

Therefore I am not so sure that overcommitting is a good idea, especially not on cloud servers where the jobs are running in VMs.

Overcommitting CPUs for the test phase is less helpful and
can cause a variety of problems as you say.

It looks like `nproc` returns 8, and make runs with 9 threads.
`meson test` uses the same value to run 9 test processes in parallel:

/builds/qemu-project/qemu/build/pyvenv/bin/meson test  --no-rebuild -t 1
--num-processes 9 --print-errorlogs

Since the host can only handle 8 parallel threads, 9 threads might already
cause some tests to run non-deterministically.
In contributor forks, gitlab CI will be using the public shared
runners. These are Google Cloud VMs, which only have 2 vCPUs.

In the primary QEMU repo, we have a customer runner registered
that uses Azure based VMs. Not sure on the resources we have
configured for them offhand.

I was talking about the primary QEMU.

The important thing there is that what you see for CI speed in
your fork repo is not neccessarily a match for CI speed in QEMU
upstream repo.

I did not run tests in my GitLab fork because I still have to figure out how to do that.

In my initial answer to Peter's mail I had described my tests and the test environment in detail.

My test environment was an older (= slow) VM with 4 cores. I tested with different values for --num-processes. As expected higher values raised the number of timeouts. And the most interesting result was that `--num-processes 1` avoided timeouts, used less CPU time and did not increase the duration.

In my tests setting --num-processes to a lower value not only avoided
timeouts but also reduced the processing overhead without increasing the
runtime.

Could we run all tests with `--num-processes 1`?
The question is what impact that has on the overall job execution
time. A lot of our jobs are already quite long, which is bad for
the turnaround time of CI testing.  Reliable CI though is arguably
the #1 priority though, otherwise developers cease trusting it.
We need to find the balance between avoiding timeouts, while having
the shortest practical job time.  The TCI job you show about came
out at 22 minutes, which is not our worst job, so there is some
scope for allowing it to run longer with less parallelism.

The TCI job terminates after less than 7 minutes in my test runs with less parallelism.

Obviously there are tests which already do their own multithreading, and maybe other tests run single threaded. So maybe we need different values for `--num-processes` depending on the number of threads which the single tests use?

Regards,

Stefan


reply via email to

[Prev in Thread] Current Thread [Next in Thread]