guix-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Inverted index to accelerate guix package search


From: Bengt Richter
Subject: Re: Inverted index to accelerate guix package search
Date: Mon, 13 Jan 2020 19:53:06 -0800
User-agent: Mutt/1.12.2 (2019-09-21)

Hi simoun,  Guix,

On +2020-01-13 18:54:18 +0100, zimoun wrote:
> Hi Arun,
> 
> For sure, it is a good direction... but it is not that simple. :-)
> 
> 
> On Mon, 13 Jan 2020 at 16:08, Arun Isaac <address@hidden> wrote:
> 
> > > Those measures don't seem precise enough to draw a good conclusion.
> > > Could you increase the sample size (or maybe just loop?) so that all
> > > times reach over a second or so?
> >
> > Indeed, my bad! Here are better timing results. I have repeated each of
> > the searches a 1000 times, and I'm getting around 90x speedup! :-)
> >
> > Building index
> > clock utime stime cutime cstime gctime
> >  9.76 12.50  0.18   0.00   0.00   6.16
> >
> > Brute force search (repeated 1000 times)
> > clock utime stime cutime cstime gctime
> > 195.95 264.13  0.90   0.00   0.00 142.62
> 
> `fold-packages` traverses all the packages so yes it is "slow".
> 
> 
> > Inverted index search (repeated 1000 times)
> > clock utime stime cutime cstime gctime
> >  2.18  2.48  0.00   0.00   0.00   0.60
> 
> Yes, looking up into an hash table is really really faster. :-)
>

If you want to instrument particular calls, as you know, you can probably
get sub-microsecond resolution (try the following snippet on your platform
to see how long it takes to get a gettimeofday value -- even on this
Acer Swift with its Intel(R) Pentium(R) CPU N4200 @ 1.10GHz I get about 10
calls per microsecond).

At that kind of resolution, timing at various milestones in your program will
show interesting regularities and outliers. You may even detect timing subsets
caused by such things as the CPU taking a shortcut when multiplying by zero.

If you time a thousand calls of A in a loop followed by the same for B,
and then do a loop that calls them in succession inside the loop, you will
most likely discover cache or other resource competitions between A and B.

My point is that benchmarking meaningfully is subtle, and it is easy to
be fooled by statistical summary measures of distributions that are really
combinations of separate distributions (that are easily, sometimes beautifully,
seen in scattergrams).


--8<---------------cut here---------------start------------->8---
#!/usr/bin/guile -s
!#
;;;; ttime -- see how fast gettimeofday is
;;;; 2017-08-24 20:44:21 -- bokr AT bokr DOT com

(use-modules (ice-9 pretty-print))
(use-modules (ice-9 format))

(define loop-count 0)
(catch #t (lambda () (set! loop-count (string->number (cadr (command-line)))))
          (lambda ( k . rest )  (begin 
          (display "\nFirst arg should be integer loop count\n")
          (display "Suggested usage from bash: ./ttime 500|uniq -c\n")
          (display "to see calls/useq\n\n")
          (exit))))
(define start-loop (gettimeofday))
(define   end-loop start-loop)

(display
    (let lp ((x loop-count) (tl `()))
        (if (positive? x) 
            (lp (- x 1) (cons (gettimeofday) tl))
            (begin
                (set! end-loop (gettimeofday))
            (format #f "~y" tl)))))
(format #t "start loop: ~a\n" start-loop)
(format #t "  end loop: ~a\n" end-loop)
(format #t "  end disp: ~a\n" (gettimeofday))
(newline)
--8<---------------cut here---------------end--------------->8---
(pls excuse the odd use of catch -- I was playing around,
meaning later to catch other errors like bad count etc., but never
got a round tuit)
So just chmod 755 ttime and then ./ttime 500|uniq -c
as it will suggest if you run it without args.

> 
> 
> However, the core issue about "search" is not 'find quickly an item'
> but 'find the relevant item'.
> For example, Bloom filter [1] or B-tree could be interesting data
> structure if we are talking about fast retrieval. Note that this part
> should even be delegated to SQLite, IMHO.
> 
> 
> [1] https://en.wikipedia.org/wiki/Bloom_filter
> [2] https://en.wikipedia.org/wiki/B-tree
> 
> 
> Well, the issue is 'scoring' a query. For example, try:
> 
>   guix search youtube | recsel -p name,relevance
> 
> --8<---------------cut here---------------start------------->8---
> name: youtube-dl-gui
> relevance: 15
> 
> name: youtube-viewer
> relevance: 11
> 
> name: youtube-dl
> relevance: 11
> 
> name: mps-youtube
> relevance: 11
> 
> name: emacs-youtube-dl
> relevance: 11
> --8<---------------cut here---------------end--------------->8---
> 
> Why the package youtube-dl-gui is more relevant? Because the function
> that computes the relevance score not enough "smart". The function
> guix/ui.scm(relevance) is too simple: the score is the number of
> ocurences of the pattern (weighted by the field where the pattern
> matches). Therefore, if you compare "guix show" the packages
> "youtube-dl-gui" and "youtube-dl", you will see that the term
> "youtube" (case-insentive) appears 5 times for youtube-dl-gui against
> only 3 times for youtube-dl. Does the difference (5 vs 3) provide more
> information? I do not think so... The issue is: 1. the description is
> not well-written (for the current scoring system) and 2. the scoring
> function is too simple.
> 
> I have tried to described such issues/views here [3] [4] and since I
> have not read so much about recommender systems: state-of-the-art, NLP
> tools in Guile, etc.
> 
> [3] https://lists.gnu.org/archive/html/guix-devel/2019-07/msg00252.html

>From [3]:
┌────────────────────────────────────────────────────────────────────────┐
│ From my opinion, text mining or search engine strategies seem a better │
│ approach to improve the `guix search` than rigidify the filename tree. │
│ And some data mining to see how the packages cluster (depending on the │
│ metric) should be helpful to first understand how to improve `guix     │
│ search`.                                                               │
│ I do not know if my words make sense.                                  │
└────────────────────────────────────────────────────────────────────────┘
They make sense to me :)

> [4] https://lists.gnu.org/archive/html/guix-devel/2019-12/msg00160.html
More stuff, skimmed, but mostly LGTM.

I would note that librarians have been helping people find relevant
information for centuries, so IWT there is something to learn from them?
Maybe not a Dewey Decimal System book classification for guix, but I am
not sure it's good to exclude tagging ideas altogether.

Another aspect of search is _who_ is doing the searching. A guix developer
wanting to find implementation examples for bytecode jit is different
from a newbie struggling to form an overview of what makes his app
stutter.

I note that papers and dissertations etc often contain prefatory
paragraphs indicating intended audience, and noting parts that are
only for specialists vs no-previous-knowledge-required.

If synopses and such contained an audience hint in a standard syntax,
might that serve as a basis for on-the-fly derivation of a search filter,
maybe with a --audience guix-dev,specialist:guile:cps option?

For object-oriented oops goops stuff, where you might like to find
a collection of methods that might serve as mixin methods for an
object of your own, how could you search for that?

(think provides/ensures/requires or other type contraints ??)

> 
> Last, I have not read the 2 links you pointed but I think you have a
> bug in your code. The retrieval "game" using the index returns the
> package "r-mgcv" which does not contain the term "game" in its
> description. Well, the query "game" using the brute force does not
> return this very package.
> 
> Then, in the function
> "guix/scripts/package.scm(find-packages-by-description)", you could
> try to replace the `fold-packages' by something using the inverted
> index. (Note that the name `find-packages-by-descripton' is odd
> because it uses synopsis, name, etc. too.).
> 
> 
> Hope that helps.
> 
> Chhers,
> simon
> 


HTH in some way :)

-- 
Regards,
Bengt Richter



reply via email to

[Prev in Thread] Current Thread [Next in Thread]