[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Discuss-gnuradio] GNURadio and CUDA reprised

From: Steven Clark
Subject: Re: [Discuss-gnuradio] GNURadio and CUDA reprised
Date: Wed, 12 Jan 2011 15:39:59 -0500

On Wed, Jan 12, 2011 at 3:22 PM, Michael Dickens <address@hidden> wrote:
On Jan 12, 2011, at 2:56 PM, Moeller wrote:
> On 12.01.2011 14:25, Michael Dickens wrote:
>> the CPU).  I think that if a GPU can be used, it will be most effective in things like filterbanks, or when searching for packets (via their unique sync sequence, so matched filtering), or very large FIR filters -- places where a LOT of computations and data must be processed and can be parallelized easily.
> Is there an efficient parallel FIR implementation for CUDA? You need only few operations on
> a large set of data. So, isn't this too much for the stream-processor local-memory?
> If GPU global memory has to be used, this would lead to a slower concurrent access.
> And then there is still the transfer time from/to the computer RAM.
> It would be great to have a fast filter, but is it really faster than an optimized SSE CPU FIR?
> I had the feeling, that the ratio of computing operations vs. number of samples has to be
> high for a significant GPU vs. CPU speedup.
> I'm curious about how much speedup you can achieve for FIR filters
> (let's say large/sharp filters of 1024 taps).

The "very large FIR filters" was a thought, as an example of an operation that might benefit from a GPU at least when using OpenCL (or CUDA).  I haven't done testing yet to know if a GPU can do better than a CPU using vector instructions ... but I'm getting there.  If/when I do get there, I'll post my results & thoughts.

Your comment about global versus local memory certainly does seem true from reading the OpenCL specs.  Most modern GPUs have 3 levels of memory: global (for the whole GPU, across all cores), core (across all kernel execution units), and kernel -- in order of decreasing size, increasing access speed, and increasing time to move data to/from.  I've been playing around with global memory only so far, but I'll look into the other levels as well to see what they can provide & the trade-offs required.

Good & interesting discussion! - MLD

Since FFTS & IFFTs are so speedy on GPUs (CUFFT is quite good now), a good way is to filter in the frequency domain via FFT -> pointwise multiply -> IFFT. That way you can have arbitrarily sharp filters.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]