|
From: | Marcus G. Daniels |
Subject: | Re: [Swarm-Modelling] ABMs on Graphical Processor Units |
Date: | Fri, 28 Dec 2007 17:47:08 -0700 |
User-agent: | Thunderbird 2.0.0.9 (X11/20071115) |
Russell Standish wrote:
There's a lot of machinery in OpenMPI that gets pulled no matter what, owing in part to multiple abstraction layers, including a component model. Perhaps MPICH would be easier to strip down, but even with static linkage it was clear to me it wasn't going to be a < 64k (and say another 64k for heap) which is basically what you'd want in order to keep it resident on the local store. (Keep in mind you want some local store to do real work and there is only 256kb per SPU.) It is possible, using the latest GCC, to build a library for the Cell SPU into overlays, and have callers automatically tickle the overlay they need. When a different overlay is needed, that means pulling it over the DMA. (Not much different cost than evicting something from L2 cache, but nonetheless a cost only experienced programmers even recognize.)All you need to do is link statically, rather than dynamically. This happens by default when you use MPICH, for instance. Then you are just loading up the parts that you use. But seriously, how much local memory do you get on a cell local store? If it is not enough to store a few megabytes of dynamic libraries, it will not be enough to do any serious ABM simulation, which tends to need 100s of MB.
The problem of keeping any serial processor busy, is one of keeping calculations close to their memory (or other blocking operations like I/O). The reality is if we don't do that, or fail to tolerate latency with built-in parallelism, then we're wasting compute cycles anyway. DDR will never be as fast as a register. And we can't just wave our hands and make all problems inherently parallel. I suppose one could wish that SPU's would each have 24MB local stores like a high-end Itanium. By my calculations that would be about 12 billion transistors.
As a datapoint, Sony's distributed [protein] address@hidden PS/3 network hit a petaflop a few months ago. They started from the standard Gromacs codebase and started reworking and optimizing.. It soon overshadowed the PC address@hidden network.. http://fah-web.stanford.edu/cgi-bin/main.py?qtype=osstats
Anyway, my point is not to push the Cell, but to say that GPUs and Cell processors, vector units, microprocessors all have tradeoffs. None of them give parallelism where it can't be proven from the code or exists as an obvious part of the algorithm.
Marcus
[Prev in Thread] | Current Thread | [Next in Thread] |