April update on / guix-build-coordinator

From: Christopher Baines
Subject: April update on / guix-build-coordinator
Date: Thu, 13 Apr 2023 19:32:54 +0100
I sent out the last update about a month ago in March [1].


## Numbers currently provides ~2.4 million nars, which take
up ~10TiB to store.

## Storage

There are two machines with all the nars, hatysa and bishan. On the plus
size, I recently installed a new hard drive in hatysa so it now has
plenty of storage for more nars.

However, space on bishan is running out, it now has less than 1TiB of
free space and will probably run out of space within the next 4 to 8

## Problems and bug fixes

In the past, I've seen the coordinator memory usage unexpectedly spike
and I haven't really understood why. This started happening again
recently, and I managed to make some progress on tracking it down. It
seems that the problem happens during calls to (backtrace). I was able
to reproduce this with the error handling for hooks, but I haven't been
able to reproduce it in a more standalone manor yet. I've opened a Guile
bug here [2], and for now, I've been trying to work around this issue by
removing backtrace calls from the coordinator. Obviously this isn't
ideal, but I'd also like to avoid this problem.


Another odd issue that I've been coming up against for a while is some
port encoding issue, I've filed a bug about that too [3]. I've been
working around this one by adding silent error handling to logging in
critical areas, so that things keep working even if the logging raises
an exception.


I also investigated why there were problems substituting derivations on
the childhurds. This was broken by some timeout related changes I made
that don't work on the hurd, this is now fixed and the error reporting
in that area is improved.

I recently spotted a crash in the build coordinator when building
anthy-9100h [4], that turned out to be due to a bug in Guile with
handling invalid unicode when using suspendable ports. This is now fixed
upstream [5] and the relevant Guile package in Guix has this patch


I think I made some progress on the write_wait_fd errors I've been
seeing from the coordinator agents. Luckily Ludovic seems to have done
most of the work, so I was able to send a patch for guile-gnutls [6].


## Progress

The Git repositories for the guix-build-coordinator and nar-herder are
now on Savannah [7], which is great as it means other committers can now
easily push to these repositories.


The big new feature I've been working on is support for listening for
events from the coordinator. This is only possible recently with the
support for streaming responses in the guile-fibers web server. While
the build coordinator isn't intended as a web service, it does make use
of http for talking to clients and agents. I've followed the standard
for server sent events [8] for this new functionality.


The client interface for the coordinator isn't exposed since there's no
authentication mechanism. However, I've also got a prototype for a web
frontend [9] for the build farm up and running. This does expose the
events stream, along with the state information that's needed to make
sense of this. The result so far is this activity page [10], which shows
information about the agents and the builds allocated to them. It's
still a very rough prototype though, and there's more work needed to
make it reliable and include more information to make it more useful.


I've also made some improvements to the build coordinator in terms of
cancelling builds, the combined post publish hook (to enable validating
the availability of referenced nars), parallel hook processing in the
coordinator and prioritising post build actions in the agents (which
helps mitigate congested uploading).

## Next steps

The bishan storage problem is growing ever closer, there's still a need
to come up with a plan. One which ideally reduces the amount of hardware
I'm personally renting.

I still need to do a bit more work on validating nar reference
availability when asking the nar-herder to import nars.

Now that the bordeaux build farm is being used for QA, there will be
some nars that no longer need to be kept (as they correspond to some
derivation that didn't end up on the master branch). It would be good to
start automatically removing these to free up space.

As above, I think a good start has been made on making the build
coordinator behind bordeaux more observable, but there's still lots of
room for improvement with that.

I also have some sysadmin things to do. The Overdrive (ARM) machine
monokuma has some btrfs issue with it's drive, so I need to reinstall
Guix on it to get it back working. I also have a RiscV board that I've
had for ages, and should get connected up to the build farm to start
building things.

If you're interested in working on any of this, do let me know, and let
me know if you have any comments or questions!



