[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Taking advantage of L1 and L2 cache in sort

From: Pádraig Brady
Subject: Taking advantage of L1 and L2 cache in sort
Date: Tue, 02 Mar 2010 10:20:34 +0000
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv: Gecko/20100111 Thunderbird/3.0.1

Currently when sorting we take advantage of the RAM vs disk
speed bump by using a large mem buffer dependent on the size of RAM.
However we don't take advantage of the cache layer in the
memory hierarchy which has an increasing importance in modern
systems given the disparity between CPU and RAM speed increases.

So let's test with a smaller buffer size to see can we
take advantage of cache locality in the comparisons.
(the system and test setup details are included below):

Using the standard large buffer as a base measurement:
  time sort < file > /dev/null
    31.6s (26.7 user)
Reducing this to 10M:
  time sort -S10M < file > /dev/null
    27.3s (22.1 user)
Now with a smaller buffer there will be many temp files
created on disk, so to alleviate this overhead somewhat,
let's put them on a ramdisk (mount -t tmpfs /ram /ram):
  time TMPDIR=/ram sort -S10M < file > /dev/null
    25.0s (22.1 user)
Now reducing the buffer to 1M to better match
my cache size mentioned below:
  time TMPDIR=/ram sort -S1M < file > /dev/null
    23.8s (21.1 user)
Now with a smaller buffer, less data will be read
up front by `sort` and so increasing the readahead
done in the background should help (posix_fadvise(...SEQUENTIAL)):
  time TMPDIR=/ram sort -S1M < file > /dev/null
    22.8s (21.1 user)
Note posix_fadvise(...WILLNEED) on the whole file runs
synchronously on my linux system at least. We don't want
that as we loose some processing time in parallel with
the kernel readahead, which is confirmed with:
  time TMPDIR=/ram sort -S1M < file > /dev/null
    25.4s (21.1 user)

So we just knocked 21% off the CPU usage and 28% off the run time. Woot!
That's with no code changes (well apart from posix_fadvise(...SEQUENTIAL)
which is being added anyway as it's seen to speedup when reading
from faster flash devices which benefit from the larger readahead.
Code changes to manage the buffers internally rather than,
manually using the ram disk, should increase the gains further.

Summary is a win*4. Faster, less CPU, much less RAM. Also there is
more granular access to the sys resources so we can process more in parallel
with readahead of the data and we contend less with other processes.


p.s. I haven't actually analysed the code or cache characteristics.
I'm just thinking about it and testing matches the theory at least.

System details:

$ grep -E "(model name|cache)" /proc/cpuinfo
model name  : Intel(R) Pentium(R) M processor 1.70GHz
cache size  : 2048 KB
$ echo "$(($(grep MemTot /proc/meminfo |
             tr -s ' ' | cut -d' ' -f2)/(1000**2)))GB"
# hdparm -tT /dev/sda #mechanical disk
 Timing cached reads:   1470 MB in  2.00 seconds = 734.98 MB/sec
 Timing buffered disk reads:   92 MB in  3.04 seconds =  30.27 MB/sec

Test setup:

$ shuf -i 1-100000000 -n 10000000 > file #88MB
$ export LANG=C
# echo 1 > /proc/sys/vm/drop_caches #before each test

reply via email to

[Prev in Thread] Current Thread [Next in Thread]