guix-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] git-download: Speed up 'git-predicate'.


From: Christopher Baines
Subject: Re: [PATCH] git-download: Speed up 'git-predicate'.
Date: Thu, 8 Jun 2017 21:43:31 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0

On 07/06/17 13:52, Ludovic Courtès wrote:
> Christopher Baines <address@hidden> skribis:
> 
>> Adjust 'git-predicate' to use data structures that perform better when used
>> with git repositories with a large number of files.
>>
>> Previously when matching either a regular file or directory, 'git-predicate'
>> would search a list with a length equal to the number of files in the
>> repository. As a search operation happens for roughly every file in the
>> repository, this meant that the time taken to use 'git-predicate' to traverse
>> all the files in a repository was roughly exponential with respect to the
>> number of files in the repository.
>>
>> Now, for matching regular files or symlinks, 'git-predicate' uses a vhash
>> using the inode value as the key. This should perform roughly in constant
>> amount of time, instead of linear with respect to the number of files in the
>> repository.
>>
>> For matching directories, 'git-predicate' now uses a tree structure stored in
>> association lists. To check if a directory is in the tree, the tree is
>> traversed from the root. The time complexity of this depends on the shape of
>> the tree, but it should be an improvement on searching through the list of 
>> all
>> files.
> 
> Great, more than welcome it seems.  :-)
> 
> Do you know how much the inode optimization vs. the tree structure
> contributes to the performance.

I'll find out the actual times for without both optimizations, and with
just one of them and get back to you.

>> * guix/git-download.scm (git-predicate): Use different data structures to
>>   speed up 'git-predicate' with a large number of files.
> 
> [...]
> 
>> +  (define (create-directory-tree files)
>> +    (define (directory-lists->tree directory-lists)
>> +      (map (lambda (top-level-dir)
>> +             (cons top-level-dir
>> +                   (directory-lists->tree
>> +                    (filter-map
>> +                     (lambda (directory-list)
>> +                       (if (eq? (length directory-list) 1)
>> +                           #f
>> +                           (cdr directory-list)))
>> +                     ;; Find all the directory lists under this 
>> top-level-dir
>> +                     (filter
>> +                      (lambda (directory-list)
>> +                        (equal? (car directory-list)
>> +                                top-level-dir))
>> +                      directory-lists)))))
>> +           (delete-duplicates
>> +            (map car directory-lists))))
>> +
>> +    (directory-lists->tree
>> +     (filter-map (lambda (path)
>> +                   (let ((split-path (string-split path #\/)))
>> +                     ;; If this is a file in the top of the repository?
>> +                     (if (eq? (length split-path) 1)
>> +                         #f
>> +                         ;; drop-right to remove the filename, as it's
>> +                         ;; just the directory tree that's important
>> +                         (drop-right (string-split path #\/) 1))))
>> +                 files)))
>> +
>> +  (define (directory-in-tree? directory tree)
>> +    (define (directory-list-in-tree? directory-list tree)
>> +      (if (eq? (length directory-list) 1)
>> +          (list? (member (car directory-list)
>> +                         (map car tree)))
>> +          (and=> (find (match-lambda
>> +                         ((top-level-dir . subtree)
>> +                          (equal? top-level-dir
>> +                                  (car directory-list))))
>> +                       tree)
>> +                 (match-lambda
>> +                   ((top-level-dir . subtree)
>> +                    (directory-list-in-tree? (cdr directory-list)
>> +                                             subtree))))))
>> +
>> +    (directory-list-in-tree? (string-split directory #\/)
>> +                             tree))
> 
> Note that ‘length’ and ‘list?’ are O(n).  I think ‘directory-in-tree?’
> should be written using ‘match’, which would avoid that altogether.
> Likewise, the (map car …) call for ‘match’.  :-)

That's interesting about length and list, I'll give using match a go and
send another patch.

> I find the tree implementation hard to grasp.  Perhaps it would help to
> move it outside of the ‘git-predicate’ function and perhaps decompose it
> a bit more?  Thoughts?

Yeah, the create-directory-tree and directory-in-tree? functions are
both general purpose, would you suggest this module, or somewhere else?

>> +         (inodes-vhash   (alist->vhash
>> +                          (map
>> +                           (lambda (file)
>> +                             (let ((stat
>> +                                    (lstat (string-append directory "/" 
>> file))))
>> +                               (cons (stat:ino stat) (stat:dev stat))))
>> +                           files)))
> 
> I would call it ‘inodes’ simply.  Also, we could use ‘list->set’ from
> (guix sets) here.

Ah, cool, I'll have a look at sending a updated patch shortly.


Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]