On Jan 12, 2011, at 2:56 PM, Moeller wrote:
> On 12.01.2011 14:25, Michael Dickens wrote:
>> the CPU). I think that if a GPU can be used, it will be most effective in things like filterbanks, or when searching for packets (via their unique sync sequence, so matched filtering), or very large FIR filters -- places where a LOT of computations and data must be processed and can be parallelized easily.
> Is there an efficient parallel FIR implementation for CUDA? You need only few operations onThe "very large FIR filters" was a thought, as an example of an operation that might benefit from a GPU at least when using OpenCL (or CUDA). I haven't done testing yet to know if a GPU can do better than a CPU using vector instructions ... but I'm getting there. If/when I do get there, I'll post my results & thoughts.
> a large set of data. So, isn't this too much for the stream-processor local-memory?
> If GPU global memory has to be used, this would lead to a slower concurrent access.
> And then there is still the transfer time from/to the computer RAM.
> It would be great to have a fast filter, but is it really faster than an optimized SSE CPU FIR?
> I had the feeling, that the ratio of computing operations vs. number of samples has to be
> high for a significant GPU vs. CPU speedup.
> I'm curious about how much speedup you can achieve for FIR filters
> (let's say large/sharp filters of 1024 taps).
Your comment about global versus local memory certainly does seem true from reading the OpenCL specs. Most modern GPUs have 3 levels of memory: global (for the whole GPU, across all cores), core (across all kernel execution units), and kernel -- in order of decreasing size, increasing access speed, and increasing time to move data to/from. I've been playing around with global memory only so far, but I'll look into the other levels as well to see what they can provide & the trade-offs required.
Good & interesting discussion! - MLD