|Subject:||Re: [Bug-gnubg] Benchmarks on server class machines and resulting change requests|
|Date:||Thu, 10 Sep 2009 19:36:11 -0600|
find attached the cleaned up benchmark data for both the 2xXeons 5130 and 2xNocona machines.
I've also done new research which now includes the impact of cache size, single threaded vs. multithreaded binary, and number of threads. The main result graph is attached, the data is in the same spreadsheed as the two other benchmarks (format OpenOffice 3.1) in 3 worksheet tabs.
The basis of the experiment were the same 5 different seven point FIBS matches used for the previous benchmarks. There were two binaries compiled, one with multihreading (GNUBGMT) and one without (GNUBGST). Both were compiled with gcc 126.96.36.199 on Debian 5.0.2, heavily optimized for core2 CPUs. SSE and SSE2 are used, code basis is gnubg.org CVS as per 2. August 2009. The hardware is a Supermicro 2xXeon 5130 machine with 6GB DDR2-5300 memory. The machine was completely idle during testing.
The 5 matches were analyzed 4 times each, resulting in a total 20 match evalutaions at 2ply/no pruning/cubeful. All caches were cleaned before each analysis. Cache size was varied from 2^1 to 2^27 bytes, resulting in 27 runs for each Graph.
* Graphs "Threads=1,2,3,4,5" are done with MT binary and the respektive settings for cache and threads, 20 matches
* Graph "No Threading" was done with GNUBGST, 20 matches
* Graph "4xNo threading á 1/4 work" was done by running 4 instances of GNUBGST with 5 matches to analyze each in parralel
* Graph "4xThreads=1 á 1/4 work" was done by running 4 instances of GNUBGMT set to use one thread, with 5 matches to analyze each in parralel
- The "spontaneois speedup" spikes seen especially for Threads=2 are oddd, i did several runs and they didn't disappear but showed in different frequency and cache size positions. I consider them bugs in the Unix time command.
- Data for Threads=6,7,8 was also collected but is not plotted, because as expected performance decreased with growing number of threads. Graph for Threads=5 shows that sufficiently, no need to clutter the diagram with more.
- The "4xThreads..." and "4x No Threading" runs aborted with out of memory for cachesize=2^26 and 2^27 (no suprise), thus no data for them.
I very much liked to hear some comments by you Jonathan (the author of the threading code). Happy with what you see? Well, I think you did a good job :)
From: Jonathan Kinsey [mailto:address@hidden]
Sent: Tuesday, August 04, 2009 9:41 AM
Cc: address@hidden; address@hidden
Subject: Re: [Bug-gnubg] Benchmarks on server class machines and resulting change requests
It's not clear if you were using the hyper-threaded machine as this might
explain the jump form 1 to 2 cores and the smaller jump to 3 and 4 cores.
If you were using machine "B", try running the test again for 1,2,3,4 threads on
machine "A". Make sure the cache size is set to maximum.
Ingo Macherius wrote:
> Christian, I've conducted your suggested experiment (batch eval of saved matches) and can confirm your answer. Calibrate ist not a suitable metric to evaluate threading behaviour for gnubg.
> The batch experiment did analyze five 7pt matches for 4 times each, with full cache cleaning. The time was taken with unix "time" command. The results are much more like what one would expect:
> - Speed peaked wheen the number of threads equaled the number of cores
> - Adding more threads than cores slowed down the evaluation (albeit, by only a tiny nit)
> - Speed decrease increased in the number of threads
> The odd finding is that there still are some anonalies, which are:
> - Going from 1 to 2 threads more than doubles the evaluation
> - It has very little effect adding more threads, i.e. the gain is not linear in # cores
> - 2, 3 and 4 threads result in speeds very close to each other, much closer than expected
> I've attached a ZIP which contains the original OpenOffice 3.1 spreadsheet and a PDF version of the graphs with the experiment details.
> Thx a lot for your guidance!
>> -----Original Message-----
>> From: Christian Anthon [mailto:address@hidden]
>> Sent: Monday, August 03, 2009 12:29 PM
>> To: Ingo Macherius
>> Cc: address@hidden
>> Subject: Re: [Bug-gnubg] Benchmarks on server class machines
>> and resulting change requests
>> The calibrate function sucks bit time. The threaded calibrate
>> function sucks even more. I'm tempted to call it useless. I
>> believe that you are observing the following: There is some
>> overhead involved in displaying and updating the calibration,
>> and as you are increasing the number of threads more and more
>> time is allocated to evaluation and less and less to
>> overhead. If you really want to test the speed of the
>> threading then you should analyse a match or perform a rollout.
>> The original calibration was meant to calibrate certain
>> timing functions against the speed of your computer, so
>> overhead didn't really matter. That is the function measures
>> the speed of your computer, not the speed of gnubg.
>> On Sun, Aug 2, 2009 at 5:06 PM, Ingo
>> Macherius wrote:
>>> I have benchmarked gnubg on two server machines, with
>> particular focus
>>> on multithreading. Both Machines are headless and run Debian 5.x
>>> Lenny, Kernel 2.6.26-2-amd64 #1 SMP x86_64 GNU/Linux. The
>> hardware is:
>>> box_A: 2xXeon 5130 @ 2GHz (4 physical cores in 2 chips)
>>> box_B: 2xXeon Nocona @ 3GHz (2 physical cores plus 2 HT
>> "cores" in 2
>>> I found two issues with current gnubg (latest CVS version
>> as of August
>>> 1st 2009, compiled with gcc 188.8.131.52 with -march=native and sse2
>>> 1) The "calibrate" command output is off by a factor of 1000, i.e.
>>> reports eval/s values 1000 times too high. This holds for
>> the figure
>>> reported in the official Debian binary installed via apt-get.
>>> 2) The limit of 16 threads is too low, I found that to
>> utilize the CPU
>>> power to 100% 8 threads per core are needed. Interestingly
>> this holds
>>> for the virtual HT cores as well.
>>> @1: Please check the timer code, the problem seems to be in
>>> Obviously the #ifdef part for Windows is fine, but all
>> other machines use a faulty version of the timer. I can't
>> really suggest a solution, but here is some background info
>> from wikipedia: http://en.wikipedia.org/wiki/Rdtsc
>>> I would help to fix this one by testing on the
>> beforementioned machines under 64 bit Linux.
>>> @2: I've tested with a custom gnubg binary with the bug at @1 fixed
>>> the hard way by dividing by 1000 hardcodedly and thread
>> limit raised
>>> to 256. While calibrate was running I've monitored CPU utilization
>>> usiing the mpstat command.
>>> box_A peaks at about 202K eval/s with 8 threads per core
>> (32 total),
>>> from where on the number is static until it starts decreasing again
>>> when you use hundreds of threads. between 1 and 3 threads I see the
>>> expected gain of almost 100% per thread added. Using 4 threads is
>>> lowering the throughput as compared to 3 threads. Between 5 and 32
>>> threads I see rising throughput which first is linear, and becomes
>>> asymptotic as we get closer to 32 threads. Below 32 threads, mpstat
>>> reports significant idle times for each CPU, at 32 I see
>> each is using
>>> 100% of the cycles.
>>> A very similar behavior is visible on box_B, despite the
>> fact 2 of its
>>> "cores" are virtual HT cores.
>>> Extrapolating the results suggests gnubg should increase
>> the limit for
>>> the number of max. threads to 64, maybe even 128 or 256. Rationale:
>>> recent server hardware with dual quadcores has 8 cores,
>> which should
>>> be fully utilizeable only with 64 threads. The suggested 128
>>> anticipates future improvements. As there seems to be little to no
>>> cost with higher values for max. threads, this seems like a
>> cheap way
>>> to speed up gnubg on server class machines and quad cores
>> at little to
>>> no cost.
>>> Bug-gnubg mailing list
>>> address@hidden http://lists.gnu.org/mailman/listinfo/bug-gnubg
>> Bug-gnubg mailing list
Celebrate a decade of Messenger with free winks, emoticons, display pics, and more. Get Them Now <http://clk.atdmt.com/UKM/go/157562755/direct/01/>
Bug-gnubg mailing list
|[Prev in Thread]||Current Thread||[Next in Thread]|