bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#32099: Benchmarks: Hashing ~70% faster (Was: New uniq option for spe


From: Assaf Gordon
Subject: bug#32099: Benchmarks: Hashing ~70% faster (Was: New uniq option for speed)
Date: Tue, 10 Jul 2018 12:21:14 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.8.0

tag 32099 notabug
severity 32099 wishlist
close 32099
stop

Hello Kingsley,

On 09/07/18 10:29 PM, Kingsley G. Morse Jr. wrote:
And I like your benchmarks.

Mine are humbly attached.

They compare sorting to hashing.

At the moment, hashing seems to me to be about 70%
faster.

And to scale better.

I'd still like uniq to be faster and free from
sort.

Thank you for taking the time to create and provide these,
including the scripts to reproduce it.

I've marked this item as "closed / wishlist" - but only because it
is technically not a bug report. Discussion is welcomed to continue
by replying to this thread.

Being "faster" alone is important but not the only criteria (as
mentioned in the previous email's examples).

Your measurements are a good start towards stronger support to
hash-based uniq implementation (a working patch, btw, would be even
stronger).

I would suggest several improvements to the measurements, to ensure
relevant information is conveyed:

1.
Specify exactly which versions of awk/sort you are using
(e.g. which versions, whether from your distribution or compiled from source, if compiled than which compiler / optimization flags).
What CPUs is being used, how much memory does the computer have, etc.

2.
I would suggest adding "sort | uniq" to the charts, because that is the
real base-line you are trying to improve.
I would also consider adding "datamash rmdup" to the charts, because its
hash-based implementation could serve as a ball-park indication of how
a naive hash implementation would work in uniq.

3.
You input is very small (up to 90K lines of 10 characters each) - that
does not seem to stress-test any significant parts of the algorithms
(and for a typical user, waiting 0.05 seconds vs 0.15 seconds does not
matter).

I would recommend much larger input files, with much longer lines -
trying to simulate various real-world scenarios.

Here's a project of someone who tried to compare various hashing
functions: https://github.com/jhunt/hash/
Under the "corpus" directory you'll find few demo data files.
Even the largest one (qnames) is only 84MB - not very big.

Perhaps consider duplicating the content few times, then test it:
  for i in $(seq 1 5);do cat qnames; done > input

To get even closer to real-world input, try several input files
with different ratios of duplicated lines (your script create completely
random data - not very similar to real-world input files).

For example, what happens with many lines share similar sub-strings?
depending on the hashing implementation, this could lead to lower performance.

4.
It would be beneficial to see what's the memory requirements and limits
of a hash-based implementation. Currently, because 'sort' is not limited
by available RAM, there is no memory limit (though resorting to disk I/O
might make everything slow).
For example, is it possible to hash an 800M file on a machine with only 1GB of RAM ? and without additional code, hashing files larger than available memory is simply not possible - a very big disadvantage.

5.
To compare "sort' fairly against other options,
I would add to the measurements using "--parallel 1" and later "--parallel N" (N being the number of cpu/cores you have).
Does that improve performance?

The "awk" (or datamash) programs do not have a built-in memory limit,
while sort does. It would be useful to specify a large amount of
memory for sort (using "-S") to ensure sort did not revert to disk I/O.

6.
The reported time in your script is the elapsed time (aka "wall time").
A more relevant option would be "user time" + "kernel time"
(optiosn %S and %U in "time -f") - the wall time can be affected by
factors that don't immediately relate to hashing/sorting performance.

7.
In your script you are providing a file-based input,
it might be useful to clear the kernel's cache before each run
to ensure these do not affect the results (especially when dealing with large files). See: https://www.tecmint.com/clear-ram-memory-cache-buffer-and-swap-space-on-linux/




To conclude,
There is definitely merit in trying to optimize uniq's performance,
but it must be weighed against other factors (e.g. the ability to 'uniq'
a file larger than available memory, and the ease of using existing programs to achieve the same effect).

IMHO it is not a top-priority, so I don't want to give the impression
that if an amazing speed improvement is presented, uniq will be swiftly
rewritten.

But of course, a working patch and strong pro arguments would go a long
way towards incorporating this feature.

regards,
 - assaf








reply via email to

[Prev in Thread] Current Thread [Next in Thread]