[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: efficient version of 'sort | uniq -c | sort -n'?
From: |
Paul Eggert |
Subject: |
Re: efficient version of 'sort | uniq -c | sort -n'? |
Date: |
Mon, 21 May 2007 14:42:20 -0700 |
User-agent: |
Gnus/5.110006 (No Gnus v0.6) Emacs/21.4 (gnu/linux) |
Matthew Woehlke <address@hidden> writes:
> Is there an efficient implementation of 'sort | uniq -c | sort -n'? I
> have a 4 GB core file I want to run 'strings' on, and the above is
> really slow.
See Jon Bently, Don Knuth, Doug McIlroy, "Programming pearls: a
literate program", CACM 29, 6 (June 1986), 471-483
<http://doi.acm.org/10.1145/5948.315654>. Source code is included. I
think Knuth's solution will run rings around all the solutions
proposed so far, if you tune it right (see below).
> Is there a way already in coreutils to do this? If not, would there
> be any interest in adding such a method?
I dunno, it sounds pretty specialized. Though there may be some
interest in a combination "sort | uniq -c", I wouldn't think there'd
be any interest in combining all three.
Philip Rowlands <address@hidden> writes:
> A fundamentally more efficient approach would be something like:
>
> perl -lne '$bucket{$_}++; END { foreach $key (keys %bucket) { print
> "$bucket{$key} $key" } }' | \
> sort -n
Hmm, well, I tried that, and the Perl approach was tons slower on my
box (Debian stable, 2.4 GHz P4, 1 GB RAM). The problem was that perl
grabbed too much memory and started thrashing. The naive 'sort'
pipeline was much faster since it did not thrash.
Assuming there are a lot of duplicates, if you want to speed up the
naive 'sort' implementation you might try this:
strings | sort -S 50% | uniq -c | sort -n
On my host this causes the first 'sort' to take about 150% of the CPU
time of the 'strings' invocation. The 'uniq' and second 'sort' take
about 100% each. So the total CPU time is about 4.5x the time it
takes to run 'strings'. Your mileage may vary of course.