bug-guix
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#51787: GC takes more than 9 hours on berlin


From: Christopher Baines
Subject: bug#51787: GC takes more than 9 hours on berlin
Date: Thu, 25 Nov 2021 13:24:17 +0000
User-agent: mu4e 1.6.6; emacs 27.2

Ludovic Courtès <ludo@gnu.org> writes:

>   • When discussing together with Chris Baines yesterday on IRC, we
>     found that the sqlite WAL file was 8 GiB.  I later ran:
>
>       PRAGMA wal_checkpoint(TRUNCATE);
>
>     which emptied it immediately.  However, GC time wasn’t particularly
>     different today.

So, as I understand it, the WAL is made up of pages, and checking for
this db, I think they're the current default size of 4096 bytes.

sqlite> PRAGMA page_size;
4096

From looking at the code, the wal_autocheckpoint value is set to 40000:

    /* Increase the auto-checkpoint interval to 40000 pages.  This
       seems enough to ensure that instantiating the NixOS system
       derivation is done in a single fsync(). */
    if (mode == "wal" && sqlite3_exec(db, "pragma wal_autocheckpoint = 40000;", 
0, 0, 0) != SQLITE_OK)
        throwSQLiteError(db, "setting autocheckpoint interval");

https://git.savannah.gnu.org/cgit/guix.git/tree/nix/libstore/local-store.cc#n253

This means you'd expect the WAL to be in the region of 40000*4096 bytes,
or ~160MB. Assuming the autocheckpointing is keeping up... it doesn't
look to be, since the file is now much larger than this.

As described here [1], the automatic checkpoints are PASSIVE ones, which
has the advantage of not interrupting any readers or writers, but can
also do nothing if it's being blocked by readers or writers.

1: https://www.sqlite.org/wal.html#application_initiated_checkpoints

What I've found while writing the Guix Build Coordinator, is that when
the service is busy (usually new builds being submitted, plus lots of
builds happening), the PASSIVE checkpoints aren't sufficient. To
supplement them, there's a regular check that looks at the size of the
WAL file, and runs a TRUNCATE checkpoint, which is a FULL checkpoint
(which blocks new writers), plus truncating the WAL file once it's
finished checkpointing the entire WAL. The truncating is mostly so that
it's easier to monitor the size of the WAL, by checking the size of the
file.

I feel like I need to defend SQLite at this point. Tuning the
configuration of a database to get acceptable performance is the norm, I
had to tune the PostgreSQL configuration for data.guix.gnu.org to
improve the performance. It's easier to get in to trouble with SQLite
because it's a lower level too, and requires you to actually initiate
things like checkpoints or periodic optimisation if you want them to
happen.

Unfortunately, I don't know enough about the internals of the daemon to
say anything specific though.

>   • ‘db.sqlite’ weighs in at 19 GiB (!) so perhaps there’s something to
>     do, like the “VACUUM” thing maybe.  Chris?

Doing a VACCUM might address some fragmentation and improve performance,
it's probably worth trying.

>   • Stracing the session’s guix-daemon process during GC suggests that
>     most of the time goes into I/O from ‘db.sqlite’.  It’s not
>     surprising because that GC phase is basically about browsing the
>     database, but it does seem to take a little too long for each store
>     item.

At least the way I've approached finding and fixing the poor performance
issues in the Guix Build Coordinator is through adding instrumentation,
so just recording the time that calling particular procedures takes, and
then logging if it's longer than some threshold.

Since this issue is about Cuirass, there's also the possibility of
avoiding the problems of a large store, by avoiding having a large
store. That's what bordeaux.guix.gnu.org does, and I thought it was part
of the plan for ci.guix.gnu.org (at least around a year ago)?

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]