emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ELPA] New package: find-dups


From: Robert Weiner
Subject: Re: [ELPA] New package: find-dups
Date: Wed, 11 Oct 2017 22:23:00 -0400

On Wed, Oct 11, 2017 at 1:56 PM, Michael Heerdegen <address@hidden> wrote:
Robert Weiner <address@hidden> writes:

> This seems incredibly complicated.  It would help if you would state
> the general problem you are trying to solve and the performance
> characteristics you need.  It certainly is not a generic duplicate
> removal library.  Why can't you flatten your list and then just apply
> a sequence of predicate matches as needed or use hashing as mentioned
> in the commentary?

I guess the name is misleading, I'll try to find a better one.

​Sounds good.  How about filter-set?  You are filtering a bunch of items to produce a set.  I'm not sure if this is limited to files or more generic.
​​

​​
Look at the example of finding files with equal contents in your file
​​

​​
system: you have a list or stream of, say, 10000 files in a file
​​

​​
hierarchy.  If you calculate hashes of all of those 10000 files,
​​
it will
​​
take hours.
​​

​​
​Ok, so you want to filter down a set of hierarchically arranged files.
​​

​​
It's wiser to do it in steps: first, look at the file's sizes of all
​​

​​
files.  That's a very fast test, and files with equal contents
​​
have the
​​
same size.  You can discard all files with unique sizes.
​​

​​
​​Yes, but that is just filtering (get the size of the files and filter to sets of files with the unique sizes).  Then you chain more filters to filter further.
   (filter-duplicates list-of-filters-to-apply list-of-files-to-filter)

which would produce a chain of filters like:
   (filterN .... (filter2 (filter1 list-of-files-to-filter))

In a second step, we have less many files.  We could look at the first N
bytes of the files.  That's still quite fast.

​So you apply your fastest and most effective filters first.
​​
  Left are groups of files
​​
with equal sizes and equal heads.  For those it's worth of calculating a
​​
hash sum to see which have also equal contents.

​Ok.
​​

​​
The idea of the library is to abstract over the type of elements and the
​​
n
​​
u
​​
m
​​
b
​​
e
​​
r
​​
​​
a
​​
n
​​
d
​​
​​
k
​​
i
​​
n
​​
d
​​
s
​​
​​
o
​​
f
​​
​​
t
​​
e
​​
s
​​
t
​​
.
​​

​But as the prior message author noted, you don't need lists of lists to do that.  We want you to simplify things so they are most​ generally useful and easier to understand.

and `find-dups' executes the algorithm with the steps as specified.  You
need just to specify a number of tests but don't need to write out the
code yourself.

​I don't quite see what code is not being written except the sequencing of the filter applications which is your code.
​​

​​
Do you need a mathematical formulation of the abstract problem that the
​​
algorithm solves, and how it works?  I had hoped the example in the
​​
header is a good explanation...

​The example is a good one to use but as was noted is only one use case.  Keep at it and you'll see it will become something much nicer.

Bob​

reply via email to

[Prev in Thread] Current Thread [Next in Thread]