[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Discuss-gnuradio] GnuRadio and CUDA

From: Inderaj Bains
Subject: Re: [Discuss-gnuradio] GnuRadio and CUDA
Date: Mon, 24 Nov 2008 08:47:54 -0800

Hi Martin

1) You seem to be using atan on host, did you try writing one for device?

2) It seems you have each block implemented separately, did you try to
put multiple ones together so that data does not have to travel to the
card multiple times

3) I don't quite understand the compilation process for cuda stuff.
Can you tell more detail on this. I have an empty block cuda block at
the end of pipeline (details follow)


Details of compile and runtime failure

I have the gr_how_to_write_a block calling the cuda funtion (in
another .cu file) that does malloc/copy/free. I am using this

        $(top_srcdir)/cudalt.py $@ $(NVCC) -c $(NVCCFLAGS) $<

==== RUN FAILURE ==============================================

address@hidden lib]# ../python/F_fm_cuda.py
Traceback (most recent call last):
  File "../python/F_fm_cuda.py", line 32, in <module> import howto
  File "/root/dev/gnuradio-3.1.3/gr-howto-write-a-block-3.1.3/src/lib/howto.py",
line 6, in <module>
     import _howto
ImportError: /usr/local/lib/python2.5/site-packages/gnuradio/_howto.so:
undefined symbol: cudaFree

==== ENV ===================================================
address@hidden lib]# export | grep -i cuda
declare -x LD_LIBRARY_PATH=":/usr/local/cuda/lib:/usr/local/cuda/lib"
declare -x 

address@hidden lib]# export | grep -i python
declare -x 
address@hidden lib]#

====MAKE TRACE=============================================
address@hidden gr-howto-write-a-block-3.1.3]# make
make  all-recursive
make[1]: Entering directory
Making all in config
make[2]: Entering directory
make[2]: Nothing to be done for `all'.
make[2]: Leaving directory
Making all in src
make[2]: Entering directory
Making all in lib
make[3]: Entering directory
make  all-am
make[4]: Entering directory
/bin/sh ../../libtool --tag=CXX   --mode=compile g++ -DHAVE_CONFIG_H
-I. -I../..  -DOMNITHREAD_POSIX=1 -pthread
-I/usr/local/include/gnuradio -I/usr/local/include
-I/usr/include/python2.5    -g -O2 -Wall -Woverloaded-virtual -pthread
-MT howto_square_ff.lo -MD -MP -MF .deps/howto_square_ff.Tpo -c -o
howto_square_ff.lo howto_square_ff.cc
libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I../..
-DOMNITHREAD_POSIX=1 -pthread -I/usr/local/include/gnuradio
-I/usr/local/include -I/usr/include/python2.5 -g -O2 -Wall
-Woverloaded-virtual -pthread -MT howto_square_ff.lo -MD -MP -MF
.deps/howto_square_ff.Tpo -c howto_square_ff.cc  -fPIC -DPIC -o
mv -f .deps/howto_square_ff.Tpo .deps/howto_square_ff.Plo
../../cudalt.py cuda_block.lo "nvcc" -c "-D_DEBUG -g -v -keep
-use_fast_math -I. -IUDASDK/common/inc" cuda_block.cu
#$ _SPACE_=
#$ _HERE_=/usr/local/cuda/bin
#$ _THERE_=/usr/local/cuda/bin
#$ TOP=/usr/local/cuda/bin/..
#$ INCLUDES="-I/usr/local/cuda/bin/../include"
#$ LIBRARIES=  "-L/usr/local/cuda/bin/../lib" -lcudart
#$ gcc -D__CUDA_ARCH__=100 -E -x c++ -DCUDA_NO_SM_13_DOUBLE_INTRINSICS
-DCUDA_FLOAT_MATH_FUNCTIONS  "-I/usr/local/cuda/bin/../include"
"-I/usr/local/cuda/bin/../include/cudart"   -I. -D__CUDACC__ -C  -fPIC
-I"." -I"UDASDK/common/inc" -D"_DEBUG" -include "cuda_runtime.h" -m32
-malign-double -g -o "cuda_block.cpp1.ii" "cuda_block.cu"
#$ cudafe --m32 --gnu_version=40102
--diag_error=host_device_limited_call -tused  --gen_c_file_name
"cuda_block.cudafe1.c" --stub_file_name "cuda_block.cudafe1.stub.c"
--stub_header_file_name "cuda_block.cudafe1.stub.h"
--gen_device_file_name "cuda_block.cudafe1.gpu" --include_file_name
cuda_block.fatbin.c "cuda_block.cpp1.ii"
-DCUDA_FLOAT_MATH_FUNCTIONS  "-I/usr/local/cuda/bin/../include"
"-I/usr/local/cuda/bin/../include/cudart"   -I. -D__CUDACC__ -C  -fPIC
-I"." -I"UDASDK/common/inc" -D"_DEBUG" -m32 -malign-double -g -o
"cuda_block.cpp2.i" "cuda_block.cudafe1.gpu"
#$ cudafe --m32 --gnu_version=40102 --c  --gen_c_file_name
"cuda_block.cudafe2.c" --stub_file_name "cuda_block.cudafe2.stub.c"
--stub_header_file_name "cuda_block.cudafe2.stub.h"
--gen_device_file_name "cuda_block.cudafe2.gpu" --include_file_name
cuda_block.fatbin.c "cuda_block.cpp2.i"
-DCUDA_FLOAT_MATH_FUNCTIONS  "-I/usr/local/cuda/bin/../include"
"-I/usr/local/cuda/bin/../include/cudart"   -I. -D__CUDABE__
-D__USE_FAST_MATH__  -fPIC -I"." -I"UDASDK/common/inc" -D"_DEBUG" -m32
-malign-double -g -o "cuda_block.cpp3.i" "cuda_block.cudafe2.gpu"
#$ filehash --skip-cpp-directives -s " " "cuda_block.cpp3.i" > "cuda_block.hash"
#$ nvopencc  -TARG:sm_10  -m32 "cuda_block.cpp3.i"  -o "cuda_block.ptx"
#$ ptxas --key="6b4cfc7a7afd183d"  -arch=sm_10  "cuda_block.ptx"  -o
#$ fatbin --key="6b4cfc7a7afd183d" --source-name="cuda_block.cu"
--usage-mode=" " --embedded-fatbin="cuda_block.fatbin.c"
#$ cudafe++ --m32 --gnu_version=40102
--diag_error=host_device_limited_call --dep_name  --gen_c_file_name
"cuda_block.cudafe1.cpp" --stub_file_name "cuda_block.cudafe1.stub.c"
--stub_header_file_name "cuda_block.cudafe1.stub.h"
#$ gcc -D__CUDA_ARCH__=100 -E -x c++ -DCUDA_NO_SM_13_DOUBLE_INTRINSICS
-DCUDA_FLOAT_MATH_FUNCTIONS  "-I/usr/local/cuda/bin/../include"
"-I/usr/local/cuda/bin/../include/cudart"   -I. -fPIC -I"."
-I"UDASDK/common/inc" -D"_DEBUG" -m32 -malign-double -g -o
"cuda_block.cu.cpp" "cuda_block.cudafe1.cpp"
#$ gcc -D__CUDA_ARCH__=100 -c -x c++ -DCUDA_NO_SM_13_DOUBLE_INTRINSICS
-DCUDA_FLOAT_MATH_FUNCTIONS  "-I/usr/local/cuda/bin/../include"
"-I/usr/local/cuda/bin/../include/cudart"   -I. -fPIC -I"."
-I"UDASDK/common/inc" -D"_DEBUG" -m32 -malign-double -g -o
".libs/cuda_block.o" "cuda_block.cu.cpp"
/bin/sh ../../libtool --tag=CXX   --mode=link g++  -g -O2 -Wall
-Woverloaded-virtual -pthread  -module -avoid-version  -o _howto.la
-rpath /usr/local/lib/python2.5/site-packages/gnuradio howto.lo
howto_square_ff.lo howto_square2_ff.lo cuda_block.lo  -lstdc++
                -L/usr/local/lib -lgnuradio-core -lgromnithread
-lfftw3f -lm
libtool: link: rm -fr  .libs/_howto.la .libs/_howto.lai .libs/_howto.so
libtool: link: g++ -shared -nostdlib
/usr/lib/gcc/i386-redhat-linux/4.1.2/crtbeginS.o  .libs/howto.o
.libs/howto_square_ff.o .libs/howto_square2_ff.o .libs/cuda_block.o
-Wl,-rpath -Wl,/usr/local/lib -Wl,-rpath -Wl,/usr/local/lib
-L/usr/local/lib /usr/local/lib/libgnuradio-core.so
/usr/local/lib/libgromnithread.so -lrt /usr/local/lib/libfftw3f.so
-L/usr/lib/gcc/i386-redhat-linux/4.1.2/../../.. -lstdc++ -lm -lc
-lgcc_s /usr/lib/gcc/i386-redhat-linux/4.1.2/crtendS.o
/usr/lib/gcc/i386-redhat-linux/4.1.2/../../../crtn.o  -pthread
-pthread -Wl,-soname -Wl,_howto.so -o .libs/_howto.so
libtool: link: ( cd ".libs" && rm -f "_howto.la" && ln -s
"../_howto.la" "_howto.la" )
make[4]: Leaving directory
make[3]: Leaving directory
Making all in python
make[3]: Entering directory
make[3]: Nothing to be done for `all'.
make[3]: Leaving directory
make[3]: Entering directory
make[3]: Nothing to be done for `all-am'.
make[3]: Leaving directory
make[2]: Leaving directory
make[2]: Entering directory
make[2]: Leaving directory
make[1]: Leaving directory

On Sun, Nov 16, 2008 at 2:46 PM, Martin DvH
<address@hidden> wrote:
> On Fri, 2008-11-14 at 16:42 -0800, Bob Keyes wrote:
>> I've just been given a Nvidia Quadro 5600 and I am thinking of using it for 
>> DSP. Has anyone experimented with USRP & gnuradio & cuda?
> I have been working on this for quite some time now.
> I did a glsl implementation a few years back but it didn't perform that
> well and had some severe limitations.
> So I started over this year and have reimplemented  major part of
> GnuRadio using CUDA.
> It is a one to one implementation.
> (every gr_something block is replaced with a cuda_something block)
> My work-in-progress code is at:
> http://gnuradio.org/trac/browser/gnuradio/branches/developers/nldudok1/gpgpu-wip
> Make sure you read
> http://gnuradio.org/trac/browser/gnuradio/branches/developers/nldudok1/gpgpu-wip/README.cuda
> Caleb Phillips made a wiki about my code, you can find it at:
> http://www.smallwhitecube.com/php/dokuwiki/doku.php?id=howto:gnuradio-with-cuda
> The majority of the gnuradio-core code is a unmodified gnuradio checkout
> of a few
> moths back.
> There are some important changes in gnuradio_core/src/lib/runtime
> to support CUDA device memory as an emulated circular buffer.
> I also implemented a gr.check_compare block which expects two input
> streams and checks if they are outputting the same data.
> I use this to check if my cuda blocks do exactly the same as the gr
> blocks.
> All the rest of the CUDA code is in gr_cuda.
> gr_cuda has to be configured and build seperately.
> gr_cuda is where  the cuda reimplementations of some gnuradio blocks
> are.
> Then there are also a few new blocks cuda_to_host and host_to_cuda which
> copy memory from and to the GPU device memory.
> All python scripts to test and use the code are in /testbed.
> The code in testbed is changing on a day-by-day basis.
> There are several issues to be well aware of when doing SDR on a GPU.
> -overhead
>        -call overhead
>        -copying data from and to the GPU
>        You need to do a lot of work on the GPU in one call to have any
> benefit.
> -circular buffers
>        -GPU memory cant't be mmapped into a circular buffer
>                -solution 1: use copying to emulate a circular buffer
>                -solution 2: keep track of all the processing and make
> your own
> intelligent scheduler which does not need a circular buffer.
> -threads: with CUDA you can't access GPU device memory from different
> host-threads. So make sure you create use and destroy all device memory
> from the same thread. (The standard GnuRadio scheduler does not do it
> like this)
> -debugging: Debugging is hard and works quite different from normal
> debugging.
> -parallel: The GPU is good in doing calculations in parallel which are
> not dependant on each other. For this reason a FIR will perform well,
> while an IIR will perform bad. An IIR can only use one processing block
> of the GPU, in stead of 128.
> It can still be benificial to do the IIR on the GPU when all your other
> blocks are running on the GPU because you don't have to copy all samples
> to the CPU, do the IIR on the CPU and copy everything back to the GPU.
> All that said. I do have a complete WFM receiver which is running
> completely on the GPU.
> (using FIR and/or FFT filters, quadrature_demod, fm-deemph)
> The FFT filters use the cuda provided FFT.
> It shouldn't be too hard to use the FFT for other things
> (just look at the code of gr_cuda/src/lib/cuda_fft_*)
> At the moment the complete wfm receiver is not running faster then on
> the CPU with my 9600GT card, mainly because of the call overhead. (too
> little work items done per call)
> And the extra copying done to emulate circular buffers.
> I can increase the amount of work done per call by using
> output_multiple. But with the current scheduling code the flow-graph can
> hang. This needs work.
> So the performance will change in the future.
> First I want to make sure everything is working as expected.
> If I benchmark a single block with a big output_multiple then I do see
> performance increases.
> Greetings,
> Martin
>> _______________________________________________
>> Discuss-gnuradio mailing list
>> address@hidden
>> http://lists.gnu.org/mailman/listinfo/discuss-gnuradio
> _______________________________________________
> Discuss-gnuradio mailing list
> address@hidden
> http://lists.gnu.org/mailman/listinfo/discuss-gnuradio


reply via email to

[Prev in Thread] Current Thread [Next in Thread]