Re: Thoughts on building things for substitutes and the Guix Build Coord

guix-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Thoughts on building things for substitutes and the Guix Build Coord

From:	Christopher Baines
Subject:	Re: Thoughts on building things for substitutes and the Guix Build Coordinator
Date:	Wed, 18 Nov 2020 07:56:39 +0000
User-agent:	mu4e 1.4.13; emacs 27.1

Ludovic Courtès <ludo@gnu.org> writes:

> Christopher Baines <mail@cbaines.net> skribis:
>
>> Because you aren't copying the store items back in to a single store, or
>> serving substitutes from the store, you don't need to scale the store to
>> serve more substitutes. You've still got a bunch of nars + narinfos to
>> store, but I think that is an easier problem to tackle.
>
> Yes, this is good for the use case of providing substitutes and it would
> certainly help on a big build farm like berlin.
>
> I see a lot could be shared with (guix scripts publish) and (guix
> scripts substitute).  We should extract the relevant bits and move them
> to new modules explicitly meant for more general consumption.  I think
> it’s important to reduce duplication.

Yeah, that would be good.

>> Another feature supported by the Guix Build Coordinator is retries. If a
>> build fails, the Guix Build Coordinator can automatically retry it. In a
>> perfect world, everything would succeed first time, but because the
>> world isn't perfect, there still can be intermittent build
>> failures. Retrying failed builds even once can help reduce the chance
>> that a failure leads to no substitutes for that builds as well as any
>> builds that depend on that output.
>
> That’s nice too; it’s one of the practical issues we have with Cuirass
> and that’s tempting to ignore because “hey it’s all functional!”, but
> then reality gets in the way.

One further benefit related to this is that if you want to manually
retry building a derivation, you just submit a new build for that
derivation.

The Guix Build Coordinator also has no concept of "Failed (dependency)",
it never gives up. This avoids the situation where spurious failures
block other builds.

>> Because the build results don't end up in a store (they could, but as
>> set out above, not being in the store is a feature I think), you can't
>> use `guix gc` to get rid of old store entries/substitutes. I have some
>> ideas about what to implement to provide some kind of GC approach over a
>> bunch of nars + narinfos, but I haven't implemented anything yet.
>
> ‘guix publish’ has support for that via (guix cache), so if we could
> share code, that’d be great.

Guix publish does time based deletion, based on when the files were
first created, right? If that works for people, that's fine I guess.

Personally, I'm thinking about GC as in, don't delete nar A if you want
to keep nar B, and nar B references nar A. It's perfectly possible that
someone could fetch nar B if you deleted nar A, but it's also possible
that someone couldn't because of that missing substitute. Maybe I'm
overthinking this though?

The Cuirass + guix publish approach does something similar, because
Cuirass creates GC roots that expire. guix gc wouldn't delete a store
item if it's needed by something that's protected by a Cuirass created
GC root.

Another complexity here that I didn't set out initially, is that there
are places the Guix Build Coordinator makes decisions based on the
belief that if it's database says a build has succeeded for an output,
that output will be available. If a situation where a build needed an
output that had been successfully built, but then deleted, I think the
coordinator would get stuck forever trying that build and it not
starting because of the missing store item. My thinking on this at the
moment is maybe what you'd want to do is tell the Guix Build Coordinator
that you've deleted a store item and it's truly missing, but that would
complicate the setup to some degree.

> One option would be to populate /var/cache/guix/publish and to let ‘guix
> publish’ serve it from there.

That's probably pretty easy to do, I haven't looked at the details
though.

>> There could be issues with the implementation… I'd like to think it's
>> relatively simple, but that doesn't mean there aren't issues. For some
>> reason or another, getting backtraces for exceptions rarely works. Most
>> of the time the coordinator tries to print a backtrace, the part of
>> Guile doing that raises an exception. I've managed to cause it to
>> segfault, through using SQLite incorrectly, which hasn't been obvious to
>> fix at least for me. Additionally, there are some places where I'm
>> fighting against bits of Guix, things like checking for substitutes
>> without caching, or substituting a derivation without starting to build
>> it.
>
> I’ve haven’t yet watched your talk but I’ve what Mathieu’s, where he
> admits to being concerned about the reliability of code involving Fibers
> and/or SQLite (which I can understand given his/our experience, although
> I’m maybe less pessimistic).  What’s your experience, how do you feel
> about it?

The coordinator does use Fibers, plus a lot of different threads for
different things.

Regarding reliability, it's hard to say really. Given I set out to build
something that works across a (unreliable) network, I've built in
reliability through making sure things retry upon failure among other
things. I definitely haven't chased any blocked fibers, although there
could be some of those lurking in the code, I might have not noticed
because it sorts itself out eventually.

One of the problems I did have recently was that some hooks would just
stop getting processed. Each type of hook has a thread, which just
checked if there were any events to process every second, and processed
any if there were. I'm not sure what was wrong, but I changed the code
to be smarter, be triggered when new events are actually entered in to
the database, and poll every so often just in case. I haven't seen hooks
get stuck since then, but what I'm trying to convey here is that I'm not
quite sure how to track down issues that occur in specific threads.

Another thing to mention here is that implementing suppport for
PostgreSQL through Guile Squee is still a thing I have in mind, and that
might be more appropriate for larger databases. It's still prone to the
fibers blocking problem, but at least it's harder to cause Segfaults
with Squee compared to SQLite.

signature.asc
Description: PGP signature

[Prev in Thread]

Current Thread

[Next in Thread]

Thoughts on building things for substitutes and the Guix Build Coordinator, Christopher Baines, 2020/11/17
- Re: Thoughts on building things for substitutes and the Guix Build Coordinator, Ludovic Courtès, 2020/11/17
  - Re: Thoughts on building things for substitutes and the Guix Build Coordinator, Christopher Baines <=
    - Re: Thoughts on building things for substitutes and the Guix Build Coordinator, Ludovic Courtès, 2020/11/20
    - Re: Thoughts on building things for substitutes and the Guix Build Coordinator, Christopher Baines, 2020/11/21
- Thoughts on CI (was: Thoughts on building things for substitutes and the Guix Build Coordinator), zimoun, 2020/11/23
  - Re: Thoughts on CI (was: Thoughts on building things for substitutes and the Guix Build Coordinator), Christopher Baines, 2020/11/24

Prev by Date: Re: updating Jami to "Together", Qt update?
Next by Date: Re: updating Jami to "Together", Qt update?
Previous by thread: Re: Thoughts on building things for substitutes and the Guix Build Coordinator
Next by thread: Re: Thoughts on building things for substitutes and the Guix Build Coordinator
Index(es):
- Date
- Thread