[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

File search progress: database review and question on triggers

From: Pierre Neidhardt
Subject: File search progress: database review and question on triggers
Date: Mon, 10 Aug 2020 16:32:08 +0200


After much delay I finally got down to work on file search support for Guix.
By "file search", I mean the ability to find which package contains files
matching the queried pattern.

If we want to be able to know which package to install, we need file search to
be able to work for packages that have not yet been installed nor built
on the system.

As we previously discussed, a good approach, mostly for performance reasons,
would be to store all files in a database that's populated on demand from the
substitute servers.

What I've done so far:

1. An SQLite database with the following schema:

--8<---------------cut here---------------start------------->8---
create table if not exists Packages (
    name        text not null,
    output      text default "out",
    system      text not null,
    path        text primary key not null, -- store path, e.g. 
    version     text not null,
    guix        text not null  -- The Guix version in which the package
    can be found.

create table if not exists Files (
    subpath     text not null,
    package  text not null,
    primary key (subpath, package), -- Same subpath can occur in multiple 
    foreign key (package) references Packages(path) on delete cascade
--8<---------------cut here---------------end--------------->8---

   I'm not very good with SQL, so thanks in advance for reviewing this
   carefully; let me know if we can do better.

2. A procedure that persists the filepaths of a given package in the database.

3. Size of the database:
   I've persisted all locally-present store items for my current Guix version
   and it produced a database of 72 MiB.  It compresses down to 8 MiB in zstd.

   But since we can have multiple Guix versions, this means that the
   packages have one entry per store path, so we might end up with more
   entries than that as the number of Guix generations grows.

   The worse case is around (number of guix generations) x ~100 MiB.

   If we compress, it would be at least 10x less, maybe way less.

   To be sustainable, I suggest that when we remove a Guix generation we
   "garbage-collect" the corresponding database entries.


4. Indexing speed:
   The above items took some 20 minutes to complete (on my rather powerful 
   A single store path takes a fraction of a second to index (on an SSD).
   The storage device is the bottleneck here.  Not sure we can do better than
   the following procedure:

--8<---------------cut here---------------start------------->8---
(define (directory-files path)
  "Return a list of all files within PATH, recursively.
Each file is returned as the path relative to PATH, starting with a '/'.

It's important that the first character be the directory separator because it
gives more expressive power for search.  For instance, searching \"/bin\"
matches both \"/bin/foo\" and \"/usr/bin/foo\" but not \"barbin\"."
  ;; TODO: This does not include empty directories.  Should we?
  ;; REVIEW: Use vlist for performance?  Big packages take a fraction of a
  ;; second on a hot cache, so it's probably not worth it.
  (let ((file-list '()))
    (ftw path
         (lambda (filename statinfo flag)
           (when (eq? flag 'regular)
             (set! file-list (cons (string-drop filename (string-length path))
                                   file-list))) #t))
--8<---------------cut here---------------end--------------->8---

   Most of the indexing will be done by the substitute servers however, so this 
   of little concern for the end user.

   Question: Should we include empty directories in the database?  I'm tempted
   to answer no.

5. Search speed: It completes in a fraction of a second and supports
   SQLite patterns.  Example:

--8<---------------cut here---------------start------------->8---
> (format-search (search-file-package "%libio%"))
samba:out@4.12.3        /lib/
--8<---------------cut here---------------end--------------->8---

   Question: This bounds us to the SQLite syntax for pattern matching.  Is it a
   It seems powerful enough in practice.  But maybe we can use regular
   expression in SQLite as well?

Next points I'd like to address:

6. Automatically persist the database entry when building a package.
   Any idea where I should plug that in?

7. Have substitute servers distribute database content.  When the user performs
   a file search, Guix asks the substitute server for a database update.  Only
   the diff should be sent over the network, not the whole thing since it might
   be very large.

   Question 1: If the substitute server does not have data corresponding to the
   Guix server of the user, shall we send data of the version that's the closest
   to that of the user?
   Locally, if there are not many entries for the current Guix version, but many
   for an older version, shall perform the search on the older version as well?
   Multiple older versions?

   Question 2: How do we send "SQLite diffs"?  I know xdelta for binary diffs,
   but I don't think that what we want to do here.  Or else can we export the 
   SELECT result as text and send it compressed over the network?

I'll send a patch soon, in the meantime let me know about your thoughts!


Pierre Neidhardt

Attachment: signature.asc
Description: PGP signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]