Re: [Discuss-gnuradio] Bidirectional communication between attached bloc
From:
Marcus Müller
Subject:
Re: [Discuss-gnuradio] Bidirectional communication between attached blocks
Date:
Mon, 20 Apr 2015 16:21:32 +0200
User-agent:
Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0
Hi Marco,
If I may recommend something, it would be having a look at VOLK [1].
It's the optimizations library that comes with GNU Radio.
If you could implement some of these algorithms in CUDA, then every
block currently using VOLK (which is the majority of the
arithmetically challenging blocks at the moment) could automatically
make use of your accelerations, without having to change anything!
Also, VOLK comes with volk_profile, which it uses to test the
different implementations that work on your hardware, looking for
the fastest one. That would be the ultimate benchmark for your
kernels, as it directly compares the efficiency of the "general C"
and CPU-SIMD implementations to your CUDA kernels.
Furthermore, gr-theano is worth a visit [2], because it actually
does CUDA to accellerate channel models. The point here is that GPUs
and their high memcpy latency (and CPU cost) aren't practical for
all problems. If I just want to add a small number of samples, doing
it on a CPU might simply pay off better; gr-theano for example
offers a FFT, which might be one of the algorithms typically working
on large vectors where the CPU/GPU boundary crossing might be worth
it.
For my thesis,I'm trying do bring various part of GnuRadio
over CUDA..
My idea is to rewrite already existing blocks with CUDA,
possibly without breaking compatibility with actual
implementation of gnuradio. In this way a normal user can use
these blocks without problems.
For the moment, I've token more confidence with gnuradio, made
an FM CUDA receiver and started to port over CUDA some blocks.
Is mandatory to minimize host-device memcpy.
My actual approach is : each block loads its code and
communicate with neighboors using async transfers,streams and
other(so I need to pass addresses of memory locations,lock
bits,etc..
My next step will be: at the beginning,each block will send
down its device code and parameters..the block at the and of
the chain will make a dynamic compilation (CUDA 7).. if I'll
have additional time I'll also use warp parallelism(reducing
global-shared memcpy)
Thanks in any case,
marco
Il giorno lun 20 apr 2015 alle ore
12:48 Marcus Müller <address@hidden>
ha scritto:
Hi Marco,
I just realized: Things might be much more easy than
that, even:
What you do sounds like a job for a hierarchical block;
if you're not used to that concept: It's just a
"subflowgraph", represented as a block with in- and
outputs.
If you put both your blocks inside, you'll always have
them together. And: in the constructor of your
hierarchical block, you can for example first construct
your cuda block, and then give your "downstream" block
the pointer to that in its constructor.
To the user, this will look like one block, though there
are two (or more) inside.
Greetings,
Marcus
On 04/20/2015 12:29 PM, marco Ribero wrote:
Thank you very much. Your solution is much cleaner.
Have a good day,
Marco
Il giorno lun 20 apr 2015
alle ore 09:29 Marcus Müller <address@hidden>
ha scritto:
Hi
marco,
what you describe as ID already exist: every
block has a function alias(), giving it a
string "name", which can be used with
global_block_registry::block_lookup(name)
[1].
You will need to wrap your alias in a
pmt::intern to get it into a stream tag, so
use that with block_lookup, and cast the
result to your_block_type::sptr.