From: Yu-Hua Yang
Subject: Re: [Discuss-gnuradio] CUDA-Enabled GNURadio gr_benchmark10 possible improvements
Date: Tue, 30 Jun 2009 02:52:19 -0400

Upon a closer look into cuda_muiltiply_const_ff_kernel.cu, there exists 5 different kernel functions to do the multiplication, where the default one,

 __global__ void
cuda_multiply_const_ff_kernel(const float* g_idata, float* g_odata,const int noutput_items,const float konst)

is completely blank. But regardless, nobody calls these kernel functions. Then, in the same file, which is called by cuda_multiply_const_ff.cc, in this function

get_cuda_multiply_const_ff_kernel_params ( cuda_multiply_const_ff_kernel_params *params )
  int result=0;
  //const unsigned int max_num_threads_per_block  = MAX_NUM_THREADS_ALL;   //can use the maximum number of threads if wanted
  //unsigned int max_num_blocks         = MAX_NUM_BLOCKS_ALL;

  unsigned int num_blocks=4096 ;// = gridDim.x;                                 //NUM_CUDABLOCKS          
  unsigned int num_threads_per_block=512;//  = blockDim.x;                     //NUM_THREADS;
  unsigned int num_outputs_per_block=num_threads_per_block;

  const unsigned int num_outputs_per_grid= num_outputs_per_block*num_blocks;  //(blockDim.x)*gridDim.x
  size_t dynamic_shared_mem_size = 0;//256*sizeof(float);//0;//num_threads_per_block*sizeof(gr_complex);
  dim3  griddim( num_blocks, 1, 1);
  dim3  threaddim( num_threads_per_block, 1, 1);


  //Now you can do the kernel invocation like this:
  //cuda_multiply_const_ff_filter_kernel<<< params->griddim, params->threaddim, params->dynamic_shared_mem_size >>>(g_idata, g_odata,  params->num_outputs_padded*X,konst);
  return result;

The kernel invocation is completely commented out! The result is initialized as 0 at the top and returns it. All the work in between to specify and allocate thread, block sizes does not seem to matter. Not sure why this code exists this way, did someone make an edit or did Martin specifically commented out the kernel invocation? Is this suppose to be this way? I don't see how this can be a proper benchmarking if it seems that we just test about allocating threads and blocks on the device and memory access times, but really don't do any computation.
I am probably way off here, doesnt make any sense......someone please clarify!

