|
| From: | Stefan Weil |
| Subject: | Re: Timeouts in CI jobs |
| Date: | Wed, 24 Apr 2024 20:10:19 +0200 |
| User-agent: | Mozilla Thunderbird |
Am 24.04.24 um 19:09 schrieb Daniel P. Berrangé:
On Wed, Apr 24, 2024 at 06:27:58PM +0200, Stefan Weil via wrote:I think the timeouts are caused by running too many parallel processes during testing. The CI uses parallel builds: make -j$(expr $(nproc) + 1) all check-build $MAKE_CHECK_ARGSNote that command is running both the compile and test phases of the job. Overcommitting CPUs for the compile phase is a good idea to keep CPUs busy while another process is waiting on I/O, and is almost always safe todo.
Thank you for your answer.
Overcommitting for the build is safe, but in my experience the positive effect is typically very small on modern hosts with fast disk I/O and large buffer caches.
And there is also a negative impact because this requires scheduling with process switches.
Therefore I am not so sure that overcommitting is a good idea, especially not on cloud servers where the jobs are running in VMs.
Overcommitting CPUs for the test phase is less helpful and can cause a variety of problems as you say.It looks like `nproc` returns 8, and make runs with 9 threads. `meson test` uses the same value to run 9 test processes in parallel: /builds/qemu-project/qemu/build/pyvenv/bin/meson test --no-rebuild -t 1 --num-processes 9 --print-errorlogs Since the host can only handle 8 parallel threads, 9 threads might already cause some tests to run non-deterministically.In contributor forks, gitlab CI will be using the public shared runners. These are Google Cloud VMs, which only have 2 vCPUs. In the primary QEMU repo, we have a customer runner registered that uses Azure based VMs. Not sure on the resources we have configured for them offhand.
I was talking about the primary QEMU.
The important thing there is that what you see for CI speed in your fork repo is not neccessarily a match for CI speed in QEMU upstream repo.
I did not run tests in my GitLab fork because I still have to figure out how to do that.
In my initial answer to Peter's mail I had described my tests and the test environment in detail.
My test environment was an older (= slow) VM with 4 cores. I tested with different values for --num-processes. As expected higher values raised the number of timeouts. And the most interesting result was that `--num-processes 1` avoided timeouts, used less CPU time and did not increase the duration.
In my tests setting --num-processes to a lower value not only avoided timeouts but also reduced the processing overhead without increasing the runtime. Could we run all tests with `--num-processes 1`?The question is what impact that has on the overall job execution time. A lot of our jobs are already quite long, which is bad for the turnaround time of CI testing. Reliable CI though is arguably the #1 priority though, otherwise developers cease trusting it. We need to find the balance between avoiding timeouts, while having the shortest practical job time. The TCI job you show about came out at 22 minutes, which is not our worst job, so there is some scope for allowing it to run longer with less parallelism.
The TCI job terminates after less than 7 minutes in my test runs with less parallelism.
Obviously there are tests which already do their own
multithreading, and maybe other tests run single threaded. So
maybe we need different values for `--num-processes` depending on
the number of threads which the single tests use?
Regards,
Stefan
| [Prev in Thread] | Current Thread | [Next in Thread] |