guix-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Postmortem of service downtime


From: Jay Sulzberger
Subject: Re: Postmortem of service downtime
Date: Thu, 23 May 2024 17:41:36 +0000 ()


On Thu, 23 May 2024, Ludovic Court??s <ludo@gnu.org> wrote:

From Sunday May 19th to Tuesday may 21st, for about 36h,
bayfront.guix.gnu.org, the machine behind many services went down:

 https://lists.gnu.org/archive/html/info-guix/2024-05/msg00000.html

Affected web sites and services included:

 guix.gnu.org
 bordeaux.guix.gnu.org
 logs.guix.gnu.org
 hpc.guix.info
 foundation.guix.info
 packages.guix.gnu.org
 qa.guix.gnu.org

Here???s the series of events that led to this:

 ??? The machine had not been rebooted for 7 months and needed to be
   rebooted to run a newer version of Shepherd (it was on 0.10.2, which
   had a bug regarding replacements that is fixed in newer versions:
   <https://issues.guix.gnu.org/67839>).

 ??? The machine did not reboot.  There???s no IPMI (this fully free system
   we acquired some years ago did not support it), so all we have is a
   remote-controlled power controller that allows us to turn it on and
   off.  This had no effect though: the machine didn???t come back.

   Fellow hackers of Aquilenet, the non-profit ISP that rents the bay
   in the data center where bayfront is, are looking into setting up
   serial console access to the machine for us.

 ??? We (Andreas and myself) scheduled an intervention in the data center
   where it is, in Bordeaux (France), and could only get there on
   Tuesday morning.

 ??? The machine was failing to boot because of an error in the Shepherd
   config (unbound variable), now fixed:

     
https://git.savannah.gnu.org/cgit/guix/maintenance.git/commit/?id=97a31249793b8af9923f915140a6732539e9d2a3

   The underlying problem is that an error in a non-essential service
   would prevent the machine from booting.  This issue is being tracked
   here:

     https://issues.guix.gnu.org/71144

   Such errors can be detected by testing the config in ???guix system
   vm???, at the cost of extra time for sysadmins.

 ??? Pulling and reconfiguring the machine was extremely slow.  This is
   in part due to spinning disks, and in part due to the fact that we
   had to pull the right commit that would allow us to not rebuild
   Linux-libre locally (substitutes for the latest upgrade, from
   Monday, were unavailable; also we had to pass
   --substitute-urls=https://hydra-guix-129.guix.gnu.org in lieu of the
   default https://bordeaux.guix.gnu.org, which was unavailable).

   A large part of the slowness was due to ???guix substitute??? reading
   all the 300K+ entries from /var/guix/substitute/cache and deleting
   them, one by one (this took several minutes).  Chris had mentioned
   that performance issue in the past; it???s not much of a problem on
   one???s laptop with an SSD, but it???s clearly a problem here where
   there are more entries than usual.  We should at least drastically
   reduce the TTL of cache entries.

 ??? qa-frontpage failed to build when we first reconfigured the machine,
   so we commented it out.  This is now fixed:

     
https://git.savannah.gnu.org/cgit/guix/maintenance.git/commit/?id=3fecb1e8fdea65a7440fec403c1c52da197b5dfe

 ??? guix-packages-website (the server behind packages.guix.gnu.org)
   still refuses to start with an Artanis error:

     https://issues.guix.gnu.org/71138

Ludo???, on behalf on the emergency rescue^W^W sysadmin team.


Dear Ludo and Team, thank you for report!

oo--JS.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]