[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: bug#15077: Clarification

From: Assaf Gordon
Subject: Re: bug#15077: Clarification
Date: Mon, 12 Aug 2013 21:02:32 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130630 Icedove/17.0.7

(CC'ing the list so that others could comment)

Hello Federico,

On 08/12/2013 06:50 PM, CDR wrote:
How do I get latest, latest version, even beta, or join, sort, etc?

I would not recommend using "beta" or "development" versions of GNU coreutils 
for production code, just to be on the safe side.
The stable releases are available as source code here:
With more details here:

One thing that I suggest is to change sort, comm and join to use more
than one core. I had to use a commercial version of sort because the
"regular" version tales for ever to sort a 15G file. The commercial
version is called nsort and it uses all the cores in the machines and
also you may add a flag to give the program a huge memory block. It
works like ten times faster than the "regular" sort.

Starting with sort version 8.6 sort can use multiple cores to improve sorting speed (see 
the "--parallel" parameter).
Sort also supports the "--buffer-size" parameter to explicitly specify how much 
memory to use.

I'm not familiar with "nsort" and can not comment on nsort vs GNU sort's speeds,
I believe that on modern hardware, sorting 15G should take few minutes at most, not 
"forever" - but that depends on many factors (e.g. cores, memory, disk, etc.).

"join" operates on sorted input, and as such, requires very little CPU and 
I  do not think much can be gained from making "join" multi-threaded.
I believe the same applies to "comm".

I am using "comm" a lot for business problem that involves comparing
daily files that have 550 MM records. I find it extremely slow. Do
you any suggestion?

Others could perhaps comment on ways to improve performance when using GNU 

I'd assume it very much depends on the technical details you're comparing - 
perhaps there are ways to improve the workflow.
First step is usually to isolate the real bottle neck (e.g. CPU, Memory, Disk 
speed, Algorithm, etc.)


reply via email to

[Prev in Thread] Current Thread [Next in Thread]