[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Taking advantage of L1 and L2 cache in sort
From: |
Joey Degges |
Subject: |
Re: Taking advantage of L1 and L2 cache in sort |
Date: |
Tue, 2 Mar 2010 10:50:57 -0800 |
2010/3/2 Pádraig Brady <address@hidden>
> Currently when sorting we take advantage of the RAM vs disk
> speed bump by using a large mem buffer dependent on the size of RAM.
> However we don't take advantage of the cache layer in the
> memory hierarchy which has an increasing importance in modern
> systems given the disparity between CPU and RAM speed increases.
>
> So let's test with a smaller buffer size to see can we
> take advantage of cache locality in the comparisons.
> (the system and test setup details are included below):
>
> Using the standard large buffer as a base measurement:
> time sort < file > /dev/null
> 31.6s (26.7 user)
> Reducing this to 10M:
> time sort -S10M < file > /dev/null
> 27.3s (22.1 user)
> Now with a smaller buffer there will be many temp files
> created on disk, so to alleviate this overhead somewhat,
> let's put them on a ramdisk (mount -t tmpfs /ram /ram):
> time TMPDIR=/ram sort -S10M < file > /dev/null
> 25.0s (22.1 user)
> Now reducing the buffer to 1M to better match
> my cache size mentioned below:
> time TMPDIR=/ram sort -S1M < file > /dev/null
> 23.8s (21.1 user)
> Now with a smaller buffer, less data will be read
> up front by `sort` and so increasing the readahead
> done in the background should help (posix_fadvise(...SEQUENTIAL)):
> time TMPDIR=/ram sort -S1M < file > /dev/null
> 22.8s (21.1 user)
> Note posix_fadvise(...WILLNEED) on the whole file runs
> synchronously on my linux system at least. We don't want
> that as we loose some processing time in parallel with
> the kernel readahead, which is confirmed with:
> time TMPDIR=/ram sort -S1M < file > /dev/null
> 25.4s (21.1 user)
>
> So we just knocked 21% off the CPU usage and 28% off the run time. Woot!
> That's with no code changes (well apart from posix_fadvise(...SEQUENTIAL)
> which is being added anyway as it's seen to speedup when reading
> from faster flash devices which benefit from the larger readahead.
> Code changes to manage the buffers internally rather than,
> manually using the ram disk, should increase the gains further.
>
> Summary is a win*4. Faster, less CPU, much less RAM. Also there is
> more granular access to the sys resources so we can process more in
> parallel
> with readahead of the data and we contend less with other processes.
>
> cheers,
> Pádraig.
>
> p.s. I haven't actually analysed the code or cache characteristics.
> I'm just thinking about it and testing matches the theory at least.
>
> System details:
>
> $ grep -E "(model name|cache)" /proc/cpuinfo
> model name : Intel(R) Pentium(R) M processor 1.70GHz
> cache size : 2048 KB
> $ echo "$(($(grep MemTot /proc/meminfo |
> tr -s ' ' | cut -d' ' -f2)/(1000**2)))GB"
> 2GB
> # hdparm -tT /dev/sda #mechanical disk
> Timing cached reads: 1470 MB in 2.00 seconds = 734.98 MB/sec
> Timing buffered disk reads: 92 MB in 3.04 seconds = 30.27 MB/sec
>
>
> Test setup:
>
> $ shuf -i 1-100000000 -n 10000000 > file #88MB
> $ export LANG=C
> # echo 1 > /proc/sys/vm/drop_caches #before each test
>
>
Fantastic results, makes a great case for SEQUENTIAL.
I've been able to reproduce these results on my quad system (no more
WILLNEED complaints :)
Thanks,
Joey