[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Taking advantage of L1 and L2 cache in sort

From: Joey Degges
Subject: Re: Taking advantage of L1 and L2 cache in sort
Date: Tue, 2 Mar 2010 10:50:57 -0800

2010/3/2 Pádraig Brady <address@hidden>

> Currently when sorting we take advantage of the RAM vs disk
> speed bump by using a large mem buffer dependent on the size of RAM.
> However we don't take advantage of the cache layer in the
> memory hierarchy which has an increasing importance in modern
> systems given the disparity between CPU and RAM speed increases.
> So let's test with a smaller buffer size to see can we
> take advantage of cache locality in the comparisons.
> (the system and test setup details are included below):
> Using the standard large buffer as a base measurement:
>  time sort < file > /dev/null
>    31.6s (26.7 user)
> Reducing this to 10M:
>  time sort -S10M < file > /dev/null
>    27.3s (22.1 user)
> Now with a smaller buffer there will be many temp files
> created on disk, so to alleviate this overhead somewhat,
> let's put them on a ramdisk (mount -t tmpfs /ram /ram):
>  time TMPDIR=/ram sort -S10M < file > /dev/null
>    25.0s (22.1 user)
> Now reducing the buffer to 1M to better match
> my cache size mentioned below:
>  time TMPDIR=/ram sort -S1M < file > /dev/null
>    23.8s (21.1 user)
> Now with a smaller buffer, less data will be read
> up front by `sort` and so increasing the readahead
> done in the background should help (posix_fadvise(...SEQUENTIAL)):
>  time TMPDIR=/ram sort -S1M < file > /dev/null
>    22.8s (21.1 user)
> Note posix_fadvise(...WILLNEED) on the whole file runs
> synchronously on my linux system at least. We don't want
> that as we loose some processing time in parallel with
> the kernel readahead, which is confirmed with:
>  time TMPDIR=/ram sort -S1M < file > /dev/null
>    25.4s (21.1 user)
> So we just knocked 21% off the CPU usage and 28% off the run time. Woot!
> That's with no code changes (well apart from posix_fadvise(...SEQUENTIAL)
> which is being added anyway as it's seen to speedup when reading
> from faster flash devices which benefit from the larger readahead.
> Code changes to manage the buffers internally rather than,
> manually using the ram disk, should increase the gains further.
> Summary is a win*4. Faster, less CPU, much less RAM. Also there is
> more granular access to the sys resources so we can process more in
> parallel
> with readahead of the data and we contend less with other processes.
> cheers,
> Pádraig.
> p.s. I haven't actually analysed the code or cache characteristics.
> I'm just thinking about it and testing matches the theory at least.
> System details:
> $ grep -E "(model name|cache)" /proc/cpuinfo
> model name  : Intel(R) Pentium(R) M processor 1.70GHz
> cache size  : 2048 KB
> $ echo "$(($(grep MemTot /proc/meminfo |
>             tr -s ' ' | cut -d' ' -f2)/(1000**2)))GB"
> 2GB
> # hdparm -tT /dev/sda #mechanical disk
>  Timing cached reads:   1470 MB in  2.00 seconds = 734.98 MB/sec
>  Timing buffered disk reads:   92 MB in  3.04 seconds =  30.27 MB/sec
> Test setup:
> $ shuf -i 1-100000000 -n 10000000 > file #88MB
> $ export LANG=C
> # echo 1 > /proc/sys/vm/drop_caches #before each test
Fantastic results, makes a great case for SEQUENTIAL.

I've been able to reproduce these results on my quad system (no more
WILLNEED complaints :)


reply via email to

[Prev in Thread] Current Thread [Next in Thread]