discuss-gnuradio
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Discuss-gnuradio] Data (was: Re: Run graph/ scheduler overhead)


From: Dennis Glatting
Subject: Re: [Discuss-gnuradio] Data (was: Re: Run graph/ scheduler overhead)
Date: Fri, 24 Jul 2015 13:06:09 -0700

On Fri, 2015-07-24 at 10:51 -0400, Tom Rondeau wrote:
> Dennis,
> 
> 
> This is great. I'd like to see how to get some of this wrapped back as
> a patch for GNU Radio. However, a few things. We can't (yet) support C
> ++11 standards. In 3.7, we allow older versions of compilers and
> dependencies that don't have to support that, so neither can we
> enforce it. I'm planning on doing something about this for our 3.8
> release. One thing we can think about doing is having cmake discover
> if the compiler can us C++11 and add that as a flag, though none of
> our code can explicitly depend on this.
> 
> 
> Second, we try not to use C++ templates, mostly because VOLK is very
> type specific. I'm really surprised here that it made such a
> difference. However, Nathan West and I are looking at this now
> thinking that using a container of this kind might not be the write
> thing to do, anyways. Plus, Nathan is thinking about how to use VOLK
> here :)
> 
> 
> If you can put together a patch that gives us a bit of a boost here,
> that'd be great. But as you say, it doesn't look like this algorithm
> as it is will ever be fantastically fast. It was definitely meant more
> for hardware than this case.
> 

I am currently working at reintegrating the components (e.g., short
form) and interfaces then emerging the code back into GNURadio 3.7.8
(e.g., stripping c++11, object* on input to work(), etc). I'll send
patches.

I manage to improve the speed a little. std::deque is now gone and
std::vector is used. Average time is now .000166 (long form) in my
complex number test.

Although I haven't looked at the assembly code I seem to recall
std::vector has consistent access times and only allocates memory on
additions (I seem to recall std::vector over allocates but that may have
been implementation specific). I'm treating the vectors (delay lines) as
circular buffers. Therefore, only object copy overhead is incurred which
should be near zero for small data structures like gr_complex and float.

I tried to compile my code without -std=c++11 and it looks like the
non-c++11 changes will be minimal. For example, I used "noexcept" and
"=delete" in a couple of places and in the test section I use
std::unique_ptr.

I enclosed my code as it exists at the moment. The two key templates are
the first two followed by the near-old code with a templated version of
the moving average and non-templated dc_blockers, then test cases. I did
a fair amount of screwing around, some for screwing around's sake.
Looking at the code I feel it got simpler.

The compile line is simple:

c++ -O3 -Wsign-compare -Wall -Wno-uninitialized -fvisibility=hidden
-finline -std=c++11 main_dc.cc

>From a robustness perspective it would be great if GNURadio used
templates in places. I've seen some replicated code and that's often
problematic. That said, simple templates are best. The less simple, the
more troublesome IMO.

Also from a robustness perspective it would be great if someday GNURadio
supported c++11 because I don't see a lot of default or copy constructor
code (or operator=). It would be more robust if those
constructors/methods were created with "=delete." 

Thanks for the input and help.



> 
> Tom
> 
> 
> 
> 
> On Thu, Jul 23, 2015 at 10:20 PM, Dennis Glatting <address@hidden>
> wrote:
>         Was able to do better by rewriting the moving average (below).
>         Didn't
>         touch the blocker class itself other than to templatize it.
>         
>         
>         address@hidden:~/dc_test$ c++ -O3 -Wsign-compare -Wall
>         -Wno-uninitialized -fvisibility=hidden -finline -std=c++11
>         main_test.cc
>         address@hidden:~/dc_test$ size a.out
>            text    data     bss     dec     hex filename
>           28835     864     408   30107    759b a.out
>         
>                         original        opt+c11
>         
>         template:       ----------      0.00019348
>         deque:          0.000701038     0.0002474
>         queue:          0.00069784      0.000248843
>         list:           0.00194583      0.00212156
>         
>         
>         For my own amusement I ran these tests on a FreeBSD machine
>         that is
>         slower than my Ubuntu machine. This machine has a different
>         processor
>         (AMD 8350) at 4GHz.
>         
>         Compile lines:
>         
>         address@hidden g++49 -O3 -Wsign-compare -Wall -Wno-uninitialized
>         -fvisibility=hidden -finline -std=c++11 main_test.cc
>         
>         address@hidden g++5 -O3 -Wsign-compare -Wall -Wno-uninitialized
>         -fvisibility=hidden -finline -std=c++11 main_test.cc
>         
>         address@hidden clang++ -O3 -Wsign-compare -Wall
>         -Wno-uninitialized
>         -fvisibility=hidden -finline -std=c++11 -I /usr/local/include
>         main_test.cc
>         
>         address@hidden clang++-devel -O3 -Wsign-compare -Wall
>         -Wno-uninitialized
>         -fvisibility=hidden -finline -std=c++11 -I /usr/local/include
>         main_test.cc
>         
>         
>                   g++49         g++5        clang++(3.4.1) clang
>         ++(3.7.0)
>         
>         template: 0.000226272  0.00022474   0.000287082    0.00023594
>         deque:    0.000297232  0.000357486  0.000693899    0.000632054
>         queue:    0.000297098  0.000357484  0.000690884    0.000627877
>         list:     0.00457583   0.00418757   0.00411387     0.0040383
>         
>         
>         (I also compiled against gcc6 snapshot 20150719 but it's
>         similar in
>         performance to gcc5.)
>         
>         It's interesting the performance for deque/queue is much worse
>         under
>         clang and the templated moving average is fairly steady across
>         all
>         compilers.
>         
>         I think what this says is the following. First, we can improve
>         the
>         performance of dc_block but against the current algorithm it's
>         never
>         going to be awesome. Second, the performance of the template
>         library
>         containers is fairly good and consistent and the choice of
>         std::deque
>         was a fairly good one. Third, different compilers and
>         different template
>         libraries at different optimization levels is going to impact
>         performance (duh), perhaps negatively.
>         
>         One of the things GNURadio is missing is some indication of
>         block/library performance. For example, maybe -O2 is better
>         than -O3 and
>         maybe -std=c++11 worse than the default on a given platform.
>         However I
>         have little idea how we'd incorporate those metrics (seems
>         kind of nasty
>         when you think about the mechanics).
>         
>         
>         
>         
>         template<typename T>
>         class moving_average_t {
>         
>         private:
>         
>           // Relationships of the following variables:
>           //
>           //  d_len   = d_delay_line.size().
>           //  d_index = (d_index % d_len)
>           //
>         
>           // The length of the moving average queue and the current
>         index into
>           // it.
>           //
>         
>           size_t d_len, d_index;
>         
>           // The moving average delay line.
>           //
>           // The delay line is an array of type T, length d_len, and
>         where
>           // d_index is the oldest entry.
>           //
>         
>           std::vector<T> d_delay_line;
>         
>           T d_out, d_out_d2;
>         
>           // Rather than returning a constructed sample from filter(),
>         which
>           // of course can have construction/destruction overhead,
>         store the
>           // sample here and return a const reference.
>           //
>         
>           T d_temp;
>         
>         public:
>         
>                    moving_average_t( void ) = delete;
>                    moving_average_t( size_t len );
>           virtual ~moving_average_t( void );
>         
>           moving_average_t( const moving_average_t& t ) = delete;
>           moving_average_t& operator=( const moving_average_t& t ) =
>         delete;
>         
>           // Set the length of the delay line, reinitializing content.
>           //
>         
>           void set_len( size_t len );
>         
>           // Add a sample to the delay line and return the oldest and
>         averaged
>           // sample (removed from the delay line).
>           //
>         
>           const T& filter( const T& t );
>         
>           // The oldest, unadjusted (i.e., averaged) signal in the
>         delay line.
>           //
>         
>           const T& delayed_sig( void ) const;
>         
>         };
>         
>         template<typename T>
>         moving_average_t<T>::moving_average_t( size_t len )
>           : d_len( 0 ), d_index( 0 ),
>             d_out( 0 ), d_out_d2( 0 ),
>             d_temp( 0 ) {
>         
>           set_len( len );
>         }
>         
>         template<typename T>
>         moving_average_t<T>::~moving_average_t( void ) {}
>         
>         template<typename T>
>         void
>         moving_average_t<T>::set_len( size_t len ) {
>         
>           d_len   = len;
>           d_index = 0;
>           d_temp  = 0;
>         
>           d_out = d_out_d2 = 0;
>         
>           d_delay_line.clear();
>           for( size_t i = 0; i < d_len - 1; ++i )
>             d_delay_line.push_back( T(0));
>         
>         }
>         
>         template<typename T>
>         inline const T&
>         moving_average_t<T>::delayed_sig( void ) const {
>         
>           return d_out;
>         }
>         
>         template<typename T>
>         inline const T&
>         moving_average_t<T>::filter( const T& t ) {
>         
>           T d_out_d1 ( d_out );
>         
>           // Cache the oldest signal used in the average.
>           //
>         
>           d_out = d_delay_line[d_index];
>         
>           // Stuff the passed sample into the delay line and bump the
>         index.
>           //
>         
>           d_delay_line[d_index] = t;
>           d_index = (( d_index + 1 ) % ( d_len - 1 ));
>         
>           // Do the math.
>           //
>         
>           d_out_d2 = (t - d_out_d1 + d_out_d2 );
>           d_temp   = d_out_d2 / (float)(d_len);
>         
>           return d_temp;
>         }
>         
>         
>         
>         
>         
>         
>         
>         
>         
>         
>         
>         
>         
>         
>         
>         On Thu, 2015-07-23 at 12:48 -0700, Dennis Glatting wrote:
>         > I copied out the dc_block_cc block from 3.7.8 and ran some
>         performance
>         > tests against it, which I've summarized in a table below.
>         >
>         > I had to make some modifications to the original code, such
>         as:
>         >
>         >  * I removed the make wrapper.
>         >  * I tested against different containers.
>         >  * Different containers have different access/management
>         methods
>         >    which meant some changes to code body (I tried to be
>         consistent).
>         >  * On input I passed a std::vector to work() rather than
>         complex*.
>         >    Although this changes the flavor of work() I figure it's
>         relative.
>         >  * I only used long_form and deleted the short_form code. I
>         used
>         >    the key part of the original code.
>         >
>         > The three containers are the original std::deque then
>         std::queue and
>         > std::list. The results are interesting. I probably should
>         have looked at
>         > other containers such as std::vector but that might require
>         recoding.
>         >
>         > I also compiled with and without -std=c++11 because when i
>         looked at
>         > container source I saw a bunch of #ifdefs for >= c++0x.
>         >
>         > These are some of the problems with the original dc_block:
>         >
>         > * Passing by value rather than by reference.
>         > * No inlines.
>         > * const needed where const should be.
>         >
>         > So in a second copy of dc_block I did those things. I found
>         a case
>         > (filter()) where it returns by value and I left that one
>         alone.
>         >
>         > The table below summarizes the results. "Old" means my
>         reasonable(?)
>         > facsimile of the original dc_block. "+c11" means I added
>         -std=c++11 to
>         > the compile line. "Opt" is my optimized copy of the code
>         where I added
>         > references, inlines, etc. "Special" is "opt" but with
>         different compile
>         > options. All of the output is included at the end of this
>         message.
>         >
>         > The numbers you'll see for old/c++1/etc is the amount of
>         time it took to
>         > process /one/ sample. In "old+deque" for example (the first
>         item), it
>         > took 701us to process a sample. One of the surprising
>         numbers is that
>         > std::list sucks. Also, when looking at the assembly language
>         for
>         > filter() (copy below) I see reallocs(). That's not
>         surprising and
>         > probably badness. (BTW, "CPLX" is: "typedef
>         std::complex<float> CPLX;".)
>         >
>         > inline const CPLX
>         > moving_averager_c_list::filter( const CPLX& x ) {
>         >
>         >   d_out_d1 = d_out;
>         >   d_delay_line.push_back(x);
>         >   d_out = d_delay_line.front();
>         >   d_delay_line.pop_front();
>         >
>         >   CPLX y = x - d_out_d1 + d_out_d2;
>         >   d_out_d2 = y;
>         >
>         >   return (y / (float)(d_length));
>         > }
>         >
>         > The "size" numbers in the table are the text segment size
>         returned using
>         > "size a.out". The "block size" is simply a
>         sizeof(d_delay_line), which
>         > is really sizeof(std:deque<CPLX>) for example.
>         >
>         > One other note. I compiled "special" with -Ofast and it
>         failed content
>         > integrity check. Probably a bad option to use. :)
>         >
>         >
>         > My os:       Ubuntu 15.04.
>         > My compiler: gcc version 4.9.2 (Ubuntu 4.9.2-10ubuntu13)
>         > My system:   AMD FX(tm)-9590 Eight-Core Processor @ 4.7GHz
>         >
>         > I'm happy to send copies of the test code (two files) for
>         review if
>         > someone wants to put them on the web. The three main code
>         blocks are
>         > pretty simple:
>         >
>         >   { dc_blocker_cc_deque dc( NUM_ELEM );
>         >
>         >     std::cout << "deque:" << std::endl;
>         >
>         >     t_start = gr::high_res_timer_now();
>         >     for( int i = 0; i < NUM_LOOPS; ++i )
>         >       for( int j = 0; j < NUM_COMPLEX; ++j )
>         >       dc.work( data, dc_deque );
>         >     timing( t_start, gr::high_res_timer_now(),
>         NUM_LOOPS*NUM_COMPLEX );
>         >
>         >   }
>         >
>         >
>         > #define NUM_LOOPS   5
>         > #define NUM_COMPLEX 10000
>         > #define NUM_ELEM    32
>         >
>         >
>         > Here's the summary table:
>         >
>         >
>         >         old          old+c11      opt         opt+c11
>         special
>         >
>         > deque:  0.000701038  0.000705963  0.000235234  0.00023607
>         0.000234233
>         > queue:  0.00069784   0.000705617  0.00023619   0.00023222
>         0.000237184
>         > list:   0.00194583   0.00243208   0.00191296   0.00193926
>         0.00194809
>         >
>         > text
>         > size:   26502        28902        21712         29574
>         23112
>         >
>         > text
>         > orig:   33821        26502
>         >
>         >
>         >         block size:
>         >
>         > deque:  80
>         > queue:  80
>         > list:   16
>         >
>         >
>         >
>         >
>         >
>         >
>         > Original facsimile (not c++11):
>         >
>         > address@hidden:~/dc_test$ c++ -O3  main.cc
>         > address@hidden:~/dc_test$ size a.out
>         >    text          data     bss     dec     hex filename
>         >   28902           856     280   30038    7556 a.out
>         >
>         > address@hidden:~/dc_test$ ./a.out
>         > Building complex number data...
>         > Done.
>         > GNURadio hi-res clock tps: 1000000000
>         > GNURadio sizeof(gr_complex): 8
>         > GNURadio sizeof(CPLX): 8
>         >
>         > dc_blocker_cc_deque: delay_line size=80
>         > deque:
>         > Done: total_t: 35051914970, sec_t: 35.0519, t/ea:
>         0.000701038
>         >
>         > dc_blocker_cc_queue: delay_line size=80
>         > queue:
>         > Done: total_t: 34892023951, sec_t: 34.892, t/ea: 0.00069784
>         >
>         > dc_blocker_cc_list: delay_line size=16
>         > list:
>         > Done: total_t: 97291349192, sec_t: 97.2913, t/ea: 0.00194583
>         >
>         >
>         >
>         >
>         > Original facsimile (c++11):
>         >
>         > address@hidden:~/dc_test$ c++ -O3 -std=c++11 main.cc
>         > address@hidden:~/dc_test$ size a.out
>         >    text          data     bss     dec     hex filename
>         >   21712           848     280   22840    5938 a.out
>         >
>         > address@hidden:~/dc_test$ ./a.out
>         > Building complex number data...
>         > Done.
>         > GNURadio hi-res clock tps: 1000000000
>         > GNURadio sizeof(gr_complex): 8
>         > GNURadio sizeof(CPLX): 8
>         >
>         > dc_blocker_cc_deque: delay_line size=80
>         > deque:
>         > Done: total_t: 35298153446, sec_t: 35.2982, t/ea:
>         0.000705963
>         >
>         > dc_blocker_cc_queue: delay_line size=80
>         > queue:
>         > Done: total_t: 35280849767, sec_t: 35.2808, t/ea:
>         0.000705617
>         >
>         > dc_blocker_cc_list: delay_line size=16
>         > list:
>         > Done: total_t: 121603777765, sec_t: 121.604, t/ea:
>         0.00243208
>         >
>         >
>         >
>         > Optimized code (not c++11):
>         >
>         > address@hidden:~/dc_test$ c++ -O3 -finline main_opt.cc
>         > address@hidden:~/dc_test$ size a.out
>         >    text          data     bss     dec     hex filename
>         >   29574           856     280   30710    77f6 a.out
>         >
>         > address@hidden:~/dc_test$ ./a.out
>         > Building complex number data...
>         > Done.
>         > GNURadio hi-res clock tps: 1000000000
>         > GNURadio sizeof(gr_complex): 8
>         > GNURadio sizeof(CPLX): 8
>         >
>         > dc_blocker_cc_deque: delay_line size=80
>         > deque:
>         > Done: total_t: 11761720007, sec_t: 11.7617, t/ea:
>         0.000235234
>         >
>         > dc_blocker_cc_queue: delay_line size=80
>         > queue:
>         > Done: total_t: 11809516472, sec_t: 11.8095, t/ea: 0.00023619
>         >
>         > dc_blocker_cc_list: delay_line size=16
>         > list:
>         > Done: total_t: 95647805916, sec_t: 95.6478, t/ea: 0.00191296
>         >
>         >
>         > Optimized code (c++11):
>         >
>         > address@hidden:~/dc_test$ c++ -O3 -finline -std=c++11
>         main_opt.cc
>         > address@hidden:~/dc_test$ size a.out
>         >    text          data     bss     dec     hex filename
>         >   23080           848     280   24208    5e90 a.out
>         >
>         >
>         > address@hidden:~/dc_test$ ./a.out
>         > Building complex number data...
>         > Done.
>         > GNURadio hi-res clock tps: 1000000000
>         > GNURadio sizeof(gr_complex): 8
>         > GNURadio sizeof(CPLX): 8
>         >
>         > dc_blocker_cc_deque: delay_line size=80
>         > deque:
>         > Done: total_t: 11803504003, sec_t: 11.8035, t/ea: 0.00023607
>         >
>         > dc_blocker_cc_queue: delay_line size=80
>         > queue:
>         > Done: total_t: 11610977298, sec_t: 11.611, t/ea: 0.00023222
>         >
>         > dc_blocker_cc_list: delay_line size=16
>         > list:
>         > Done: total_t: 96962902014, sec_t: 96.9629, t/ea: 0.00193926
>         >
>         >
>         > special (opt+c++11):
>         >
>         > address@hidden:~/dc_test$ c++ -Ofast -Wsign-compare
>         -Wall
>         > -Wno-uninitialized -fvisibility=hidden -finline -std=c++11
>         main_opt.cc
>         > address@hidden:~/dc_test$ size a.out
>         >    text          data     bss     dec     hex filename
>         >   23112           856     280   24248    5eb8 a.out
>         >
>         >
>         > address@hidden:~/dc_test$ ./a.out
>         > Building complex number data...
>         > Done.
>         > GNURadio hi-res clock tps: 1000000000
>         > GNURadio sizeof(gr_complex): 8
>         > GNURadio sizeof(CPLX): 8
>         >
>         > dc_blocker_cc_deque: delay_line size=80
>         > deque:
>         > Done: total_t: 11711630308, sec_t: 11.7116, t/ea:
>         0.000234233
>         >
>         > dc_blocker_cc_queue: delay_line size=80
>         > queue:
>         > Done: total_t: 11859205796, sec_t: 11.8592, t/ea:
>         0.000237184
>         >
>         > dc_blocker_cc_list: delay_line size=16
>         > list:
>         > Done: total_t: 97404287524, sec_t: 97.4043, t/ea: 0.00194809
>         >
>         > Data error i=0

Attachment: main_dc.cc
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]