discuss-gnuradio
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Discuss-gnuradio] CUDA-Enabled GNURadio gr_benchmark10 possible im


From: Martin DvH
Subject: Re: [Discuss-gnuradio] CUDA-Enabled GNURadio gr_benchmark10 possible improvements
Date: Wed, 01 Jul 2009 00:13:56 +0200

On Tue, 2009-06-30 at 02:52 -0400, Yu-Hua Yang wrote:
> Upon a closer look into cuda_muiltiply_const_ff_kernel.cu, there
> exists 5 different kernel functions to do the multiplication, where
> the default one,
> 
>          __global__ void
>         cuda_multiply_const_ff_kernel(const float* g_idata, float*
>         g_odata,const int noutput_items,const float konst) 
> 
> is completely blank. But regardless, nobody calls these kernel
> functions. 
cuda_multiply_const_ff was never finished or I accidently removed some
code here.
Either way, this is clearly a bug.
The empty kernel should call one of the implemented kernels.


> Then, in the same file, which is called by cuda_multiply_const_ff.cc,
> in this function
> 
>         int
>         get_cuda_multiply_const_ff_kernel_params
>         ( cuda_multiply_const_ff_kernel_params *params )
>         {
>           int result=0;
>           //const unsigned int max_num_threads_per_block  =
>         MAX_NUM_THREADS_ALL;   //can use the maximum number of threads
>         if wanted
>           //unsigned int max_num_blocks         = MAX_NUM_BLOCKS_ALL;
>         
>           unsigned int num_blocks=4096 ;// =
>         gridDim.x;                                 //NUM_CUDABLOCKS           
>           unsigned int num_threads_per_block=512;//  =
>         blockDim.x;                     //NUM_THREADS;
>           unsigned int num_outputs_per_block=num_threads_per_block;
>         
>           const unsigned int num_outputs_per_grid=
>         num_outputs_per_block*num_blocks;  //(blockDim.x)*gridDim.x 
>          
>           size_t dynamic_shared_mem_size =
>         0;//256*sizeof(float);//0;//num_threads_per_block*sizeof(gr_complex);
>           dim3  griddim( num_blocks, 1, 1);
>           dim3  threaddim( num_threads_per_block, 1, 1);
>         
>           params->griddim=griddim;
>           params->threaddim=threaddim;
>           params->dynamic_shared_mem_size=dynamic_shared_mem_size;
>           params->num_outputs_padded=num_outputs_per_grid;
>           params->num_inputs_padded=num_outputs_per_grid;
>           params->num_inputs=0;//num_outputs_per_grid;//num_outputs;
>           params->num_outputs=0;//num_outputs_per_grid;//num_outputs;
>         
>           //Now you can do the kernel invocation like this:
>           //cuda_multiply_const_ff_filter_kernel<<< params->griddim,
>         params->threaddim, params->dynamic_shared_mem_size
>         >>>(g_idata, g_odata,  params->num_outputs_padded*X,konst);
>           return result;
>         }
> 
> The kernel invocation is completely commented out! The result is
> initialized as 0 at the top and returns it. All the work in between to
> specify and allocate thread, block sizes does not seem to matter. Not
> sure why this code exists this way, did someone make an edit or did
> Martin specifically commented out the kernel invocation?
Yes the kernel invocation is specifically commented out.
The commented out kernel invocation  is only here as documentation on
how to do the actual kernel invocation.

There is still a small typo.
It should refer to cuda_multiply_const_ff_kernel and not
cuda_multiply_const_ff_filter_kernel



This methods name is get_cuda_multiply_const_ff_kernel_params
It does just that, determine the needed params for the kernel
invocation.

The actual kernel invocation is done in:
cuda_multiply_const_ff_work_device


Both get_cuda_multiply_const_ff_kernel_params and
cuda_multiply_const_ff_work_device are called from normal C++ code in
cuda_multiply_const_ff_kernel.cu

Better examples to look at and benchmark would be
cuda_quadrature_demod_cuda_cf
cuda_fir_filter_fff

These are both used in 
testbed/wfm/cuda_wfm_rcv.py
Which calls the complete cuda implementation of a WFM receiver in
cuda_wfm_rcv_cuda.py

You will notice  a cuda_multiply_const_ff block is instantiated as
volume_control but not used in the final connect because it didn't work.
Now I know this is because of the bug you found.


>  Is this suppose to be this way? I don't see how this can be a proper
> benchmarking if it seems that we just test about allocating threads
> and blocks on the device and memory access times, but really don't do
> any computation.
That is not a good benchmark indeed.
cuda_multiply_const_ff should be fixed first.

The fir-filter is however the thing you really want to benchmark.
(Which is also in there)
The more taps the fir-filter has, the more computations it will have to
make.
Note that the cuda implementation is at the moment limited in the number
of taps it can support at a certain decimation factor.
If I remember well decimationfactor*numberoftaps should stay below 2048
(or was it 512)

Ways of improving the resuls include having gnuradio do calculations in
bigger chunks.
This can be done in several ways.
One way is to have a block at the end of the processing chain which has
a large output_multiple requirement which needs 1 input for every
output.
In other words. Connect a block to the end of the processing chain which
has set_output_multiple(large_number) in its constructor.

You can use gr_test for this.

There are other ways which include changing the scheduler and the buffer
instantiating code.


I hope this helps,

Martin

> I am probably way off here, doesnt make any sense......someone please
> clarify!
> 
> _______________________________________________
> Discuss-gnuradio mailing list
> address@hidden
> http://lists.gnu.org/mailman/listinfo/discuss-gnuradio





reply via email to

[Prev in Thread] Current Thread [Next in Thread]