Re: efficient version of 'sort | uniq -c

bug-coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: efficient version of 'sort | uniq -c | sort -n'?

From:	Paul Eggert
Subject:	Re: efficient version of 'sort \| uniq -c \| sort -n'?
Date:	Mon, 21 May 2007 14:42:20 -0700
User-agent:	Gnus/5.110006 (No Gnus v0.6) Emacs/21.4 (gnu/linux)

Matthew Woehlke <address@hidden> writes:

> Is there an efficient implementation of 'sort | uniq -c | sort -n'? I
> have a 4 GB core file I want to run 'strings' on, and the above is
> really slow.

See Jon Bently, Don Knuth, Doug McIlroy, "Programming pearls: a
literate program", CACM 29, 6 (June 1986), 471-483
<http://doi.acm.org/10.1145/5948.315654>.  Source code is included.  I
think Knuth's solution will run rings around all the solutions
proposed so far, if you tune it right (see below).

> Is there a way already in coreutils to do this? If not, would there
> be any interest in adding such a method?

I dunno, it sounds pretty specialized.  Though there may be some
interest in a combination "sort | uniq -c", I wouldn't think there'd
be any interest in combining all three.

Philip Rowlands <address@hidden> writes:

> A fundamentally more efficient approach would be something like:
>
> perl -lne '$bucket{$_}++; END { foreach $key (keys %bucket) { print 
> "$bucket{$key} $key" } }' | \
>   sort -n

Hmm, well, I tried that, and the Perl approach was tons slower on my
box (Debian stable, 2.4 GHz P4, 1 GB RAM).  The problem was that perl
grabbed too much memory and started thrashing.  The naive 'sort'
pipeline was much faster since it did not thrash.

Assuming there are a lot of duplicates, if you want to speed up the
naive 'sort' implementation you might try this:

strings | sort -S 50% | uniq -c | sort -n

On my host this causes the first 'sort' to take about 150% of the CPU
time of the 'strings' invocation.  The 'uniq' and second 'sort' take
about 100% each.  So the total CPU time is about 4.5x the time it
takes to run 'strings'.  Your mileage may vary of course.

[Prev in Thread]

Current Thread

[Next in Thread]

efficient version of 'sort | uniq -c | sort -n'?, Matthew Woehlke, 2007/05/21
- Re: efficient version of 'sort | uniq -c | sort -n'?, James Youngman, 2007/05/21
  - Re: efficient version of 'sort | uniq -c | sort -n'?, Matthew Woehlke, 2007/05/21
    - Re: efficient version of 'sort | uniq -c | sort -n'?, Philip Rowlands, 2007/05/21
    - Re: efficient version of 'sort | uniq -c | sort -n'?, Matthew Woehlke, 2007/05/21
- Re: efficient version of 'sort | uniq -c | sort -n'?, Paul Eggert <=
  - Re: efficient version of 'sort | uniq -c | sort -n'?, Matthew Woehlke, 2007/05/21

Prev by Date: Re: efficient version of 'sort | uniq -c | sort -n'?
Next by Date: Re: efficient version of 'sort | uniq -c | sort -n'?
Previous by thread: Re: efficient version of 'sort | uniq -c | sort -n'?
Next by thread: Re: efficient version of 'sort | uniq -c | sort -n'?
Index(es):
- Date
- Thread