[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Thoughts on building things for substitutes and the Guix Build Coord

From: Ludovic Courtès
Subject: Re: Thoughts on building things for substitutes and the Guix Build Coordinator
Date: Fri, 20 Nov 2020 11:12:49 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)

Hi Chris,

Christopher Baines <> skribis:

>>> Another feature supported by the Guix Build Coordinator is retries. If a
>>> build fails, the Guix Build Coordinator can automatically retry it. In a
>>> perfect world, everything would succeed first time, but because the
>>> world isn't perfect, there still can be intermittent build
>>> failures. Retrying failed builds even once can help reduce the chance
>>> that a failure leads to no substitutes for that builds as well as any
>>> builds that depend on that output.
>> That’s nice too; it’s one of the practical issues we have with Cuirass
>> and that’s tempting to ignore because “hey it’s all functional!”, but
>> then reality gets in the way.
> One further benefit related to this is that if you want to manually
> retry building a derivation, you just submit a new build for that
> derivation.
> The Guix Build Coordinator also has no concept of "Failed (dependency)",
> it never gives up. This avoids the situation where spurious failures
> block other builds.

I think there’s a balance to be found.  Being able to retry is nice, but
“never giving up” is not: on a build farm, you could end up always
rebuilding the same derivation that in the end always fails, and that
can be a huge resource waste.

On berlin we run the daemon with ‘--cache-failures’.  It’s not great
because, again, it prevents further builds altogether.

Which makes me think we could change the daemon to have a threshold: it
would maintain a derivation build failure count (instead of a Boolean)
and would only prevent rebuilds once a failure threshold has been

>>> Because the build results don't end up in a store (they could, but as
>>> set out above, not being in the store is a feature I think), you can't
>>> use `guix gc` to get rid of old store entries/substitutes. I have some
>>> ideas about what to implement to provide some kind of GC approach over a
>>> bunch of nars + narinfos, but I haven't implemented anything yet.
>> ‘guix publish’ has support for that via (guix cache), so if we could
>> share code, that’d be great.
> Guix publish does time based deletion, based on when the files were
> first created, right? If that works for people, that's fine I guess.

Yes, it’s based on the atime (don’t use “noatime”!), though (guix cache)
lets you implement other policies.

> Personally, I'm thinking about GC as in, don't delete nar A if you want
> to keep nar B, and nar B references nar A. It's perfectly possible that
> someone could fetch nar B if you deleted nar A, but it's also possible
> that someone couldn't because of that missing substitute. Maybe I'm
> overthinking this though?

I think you are.  :-) ‘guix publish’ doesn’t do anything this fancy and
it works great.  The reason is that clients typically always ask for
both A and B, thus the atime of A is the same as that of B.

> The Cuirass + guix publish approach does something similar, because
> Cuirass creates GC roots that expire. guix gc wouldn't delete a store
> item if it's needed by something that's protected by a Cuirass created
> GC root.

Cuirass has a TTL on GC roots, which thus defines how long things remain
in the store; ‘publish’ has a TTL on nars, which defines how long nars
remain in its cache.  The two are disconnected in fact.

> Another complexity here that I didn't set out initially, is that there
> are places the Guix Build Coordinator makes decisions based on the
> belief that if it's database says a build has succeeded for an output,
> that output will be available. If a situation where a build needed an
> output that had been successfully built, but then deleted, I think the
> coordinator would get stuck forever trying that build and it not
> starting because of the missing store item. My thinking on this at the
> moment is maybe what you'd want to do is tell the Guix Build Coordinator
> that you've deleted a store item and it's truly missing, but that would
> complicate the setup to some degree.

I think you’d just end up rebuilding it in that case, no?

>> I’ve haven’t yet watched your talk but I’ve what Mathieu’s, where he
>> admits to being concerned about the reliability of code involving Fibers
>> and/or SQLite (which I can understand given his/our experience, although
>> I’m maybe less pessimistic).  What’s your experience, how do you feel
>> about it?
> The coordinator does use Fibers, plus a lot of different threads for
> different things.

Interesting, why are some things running in threads?

There also seems to be shared state in ‘create-work-queue’; why not use
message passing?

> Regarding reliability, it's hard to say really. Given I set out to build
> something that works across a (unreliable) network, I've built in
> reliability through making sure things retry upon failure among other
> things. I definitely haven't chased any blocked fibers, although there
> could be some of those lurking in the code, I might have not noticed
> because it sorts itself out eventually.

OK.  Most of the issues we see now with offloading and Cuirass are
things you can only experience with a huge store, a large number of
build machines, and a lot of concurrent derivation builds.  Perhaps you
are approaching this scale on your instance actually?

Mathieu experimented with the Coordinator on berlin.  It would be nice
to see how it behaved there.

> One of the problems I did have recently was that some hooks would just
> stop getting processed. Each type of hook has a thread, which just
> checked if there were any events to process every second, and processed
> any if there were. I'm not sure what was wrong, but I changed the code
> to be smarter, be triggered when new events are actually entered in to
> the database, and poll every so often just in case. I haven't seen hooks
> get stuck since then, but what I'm trying to convey here is that I'm not
> quite sure how to track down issues that occur in specific threads.
> Another thing to mention here is that implementing suppport for
> PostgreSQL through Guile Squee is still a thing I have in mind, and that
> might be more appropriate for larger databases. It's still prone to the
> fibers blocking problem, but at least it's harder to cause Segfaults
> with Squee compared to SQLite.


I find it really nice to have metrics built in, but I share Mathieu’s
concern about complexity here.  If we’re already hitting scalability
issues with SQLite, then perhaps that’s a sign that metrics should be
handled separately.

Would it be an option to implement metrics gathering in a separate,
optional process, which would essentially subscribe to the relevant


reply via email to

[Prev in Thread] Current Thread [Next in Thread]