[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Feature request: testline(tl) (RFC)
From: |
V.Krishn |
Subject: |
Re: Feature request: testline(tl) (RFC) |
Date: |
Wed, 10 Dec 2014 05:38:08 +0530 |
User-agent: |
KMail/1.13.7 (Linux/3.9.6-64; KDE/4.8.4; x86_64; ; ) |
> On 09/12/14 22:20, V.Krishn wrote:
> > Hi,
> >
> > Was reading about bloom filter,
> > and came upon this example,
> >
> > http://troydhanson.github.io/misc/bloom.html
> > ------
> > The bf test program
> >
> > The program bf.c implements a Bloom filter. It can be used like,
> >
> > ./bf -n 16 members.txt test.txt
> >
> > Where the lines of members.txt are the true set members and the lines of
> > test.txt will be tested for membership. Varying n shows how the error
> > rate increases with smaller values of n.
> > ------
> >
> > Source: https://github.com/troydhanson/misc
> > code:
> > https://raw.githubusercontent.com/troydhanson/misc/master/compression/blo
> > om/bf.c
> >
> > REQUEST:
> > Wondering if a simple implementation to test lines could be added to
> > coreutils Features:
> > 1. report if some lines missing (option to print)
> > 2. option to print found lines
> > 3. option to print missing lines
> > 4. ....more logic posible...
> >
> > -------------
> > Presently, I can achive the same using simple shell script by calling
> > grep on each line or using `comm`
> > But believe that method using bloom should be faster and result in a uniq
> > and useful tool.
> >
> > Please ignore or guide if any similar util already exists.
>
> Maybe we should keep the existing interfaces of grep, uniq, comm etc.
> and use a bloom filter _internally_ if appropriate.
>
Such internal use should be explicit options in tools like grep, uniq, comm
etc and not set by default.
eg. comm --use-bloom <bloom options>
Reasons: using hashes has its on pros/cons and should not be a surprise by
making it default.
--
Regards.
V.Krishn