[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Octave 3.6.0 on Windows XP plot fails.

From: Przemek Klosowski
Subject: Re: Octave 3.6.0 on Windows XP plot fails.
Date: Wed, 29 Feb 2012 11:54:57 -0500
User-agent: Mozilla/5.0 (X11; Linux i686; rv:10.0) Gecko/20120131 Thunderbird/10.0

On 02/29/2012 10:01 AM, Michael Goffioul wrote:
On Wed, Feb 29, 2012 at 2:28 PM, Martin Helm<address@hidden>  wrote:
Am 29.02.2012 14:45, schrieb Xianyi Zhang:
The matrix multiplication cannot obtain the performance from

Why not? Is this a limitation of the mingw compiler, the windows
environment or the BLAS library in question?

No, I think it's because of the principle of hyperthreading. HT does
not mean you magically have 4 independent cores out of 2. You still
have only 2 physical cores, but some parts of each core are duplicated
such that they can appear as 4 instead of 2 at the OS level. However,
the processing unit is not duplicated: so within a single physical
core, each logical CPU will have to wait its turn on the processing

Hyperthreading aka SMT provides two sets of registers but only one ALU and memory interface unit for loading and storing data to the main memory. The main benefit of SMT is masking the memory latency: run the thread whose data is already loaded into CPU registers, while the other thread is stalled waiting for the DRAM data.

Given that the register-based instructions run roughly at 1 clock per instruction, and the memory latency (time for the load/store unit in the CPU to send the address to the interface, the virtual-physical translation via the TLBs, cache lookups, and finally the DRAM access and the data's trip back) is measured in tens of nanoseconds, if one thread is waiting on a main memory data request, the other thread can run 50 or so instructions if it has the data loaded up into registers. In the best possible world, by the time the second thread needs some DRAM data the first thread would have finished loading its data, and they might end up alternating, covering up each other's DRAM latency.

Unfortunately, matrix multiplication tends to be memory-intensive (load two numbers, multiply, accumulate); there's not much opportunity for long register-based calculations. It turns out that hyperthreading does show some limited benefit but the general recommendation is that it's not very useful.

There's a silver lining though: caching and pre-loading optimal consecutive chunks of arrays, and using vector operations such as SSE does work, which is what ATLAS and GotoBLAS are doing.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]