[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Discuss-gnuradio] Using volk kernels on basic operations of gr_comp

From: Marcus Müller
Subject: Re: [Discuss-gnuradio] Using volk kernels on basic operations of gr_complex, in my own custom blocks.
Date: Mon, 7 Mar 2016 22:45:47 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.1.0

Hi Gonzalo,

> I installed perf top but i am not sure how to use it.. I will investigate it.

Assuming you have build GNU Radio/your application with debugging symbols (for example, by having "cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo .."), try something like:

sudo sysctl kernel/perf_event_paranoid=-1
perf record -a {your program}
perf report

Best regards,

On 03/07/2016 10:30 PM, Gonzalo Arcos wrote:
Thanks for your answer.

 I installed perf top but i am not sure how to use it.. I will investigate it. However, does the program need to be compiled in debug mode for the performance counters to have effect?

As a side question... Has anyone managed to profile a gnuradio application with valgrind / oprofile? I am very interested in getting this to work, since when i tried profiling it with those tools, and then opening KCacheGrind, the displayed graph did not contain information about each block, let alone functions inside blocks. It has been several months since i tried this, but i remember that the result was like 99.9% of the time running the start() function of the block, and i could not get any more information than that, which of course was not helpful at all.

2016-02-29 6:28 GMT-03:00 West, Nathan <address@hidden>:
It won't give you time spent, but 'perf top' is a nice tool that gives function-level performance counters for all running code. It comes with linux-tools and uses performance counters built in to the kernel. There's also a couple of other perf subtools you can explore.

Regarding your full buffers, I think that's a result of GNU Radio's scheduler.
 If you have a flowgraph with A->B and B takes a very long time to process all of its samples then A will always have full output buffers since it operates much faster. It's not necessarily bad or cause for concern, but performance improvements should focus on B.


On Sun, Feb 28, 2016 at 10:48 PM, Gonzalo Arcos <address@hidden> wrote:
Thanks to all of you for your very informative answers.

Douglas, i feel good now because you have described perfectly all the things i did / thought on how to improve the performance :), i also agree that merging blocks should be a last time resort. I have used the performance monitor and managed to improve the perofrmance of the most expensive blocks. What i could not achieve though, is profiling the program with a mainstream profiler, like valgrind or oprofile, or some other profilers for python. I remember than when visualizing the data, all the time was spent in the start() of the top block, and i could not get information pertaining each blocks general work, let alone functions executed within the block. After discovering the performance monitor, i used it in conjuntion with calls to clock() to determine the time spent in each function within each block, to get a rough measurement. But if it is possible to get this information automatically, i am very interested in learning how to do it. Could you help me?

There is also another interesting aspect of improving performance, which is blocks being blocked due to the output buffer being full. Ive tried playing around a bit with the min and max output buffer sizes, but the performance did not seem to be affected.
After using the performance monitor to analyze the buffer average full %, i see that most of them are relatively full, however, i do not know if they are full enough to make an upstream block to have to wait to push data into the buffer.

2016-02-28 19:39 GMT-03:00 Douglas Geiger <address@hidden>:
The phenomenon Sylvain is pointing at is basically the fact that as compilers improve, you should expect the 'optimized' proto-kernels to no longer have as dramatic an improvement compared with the generic ones. As to your question of 'is it worth it' - that comes down to a couple of things: for example - how much of an improvement do you require to be 'worth it' (i.e., how much is your time worth and/or how much of an performance improvement do you require for your application). Similarly, is it worth it to you to get cross-platform improvements (which is one of the features of VOLK)? Or, perhaps, is it worth it to you just to learn how to use VOLK?

A couple of thoughts here: in general, when I have a flowgraph that is not meeting my performance requirements, my first step is to do some course profiling (i.e. via gr-perf-monitorx) to determine if there is a single block that is my primary performance bottleneck. If so - that is the block I will concentrate on for optimizations (both via VOLK, and/or any algorithmic improvements - e.g. can I turn any run-time calculations into a look-up table calculated either at compile-time, or within the constructor).
 If there is not a clear bottleneck, then next I look a little deeper using perf/oprofile to look at what functions my flowgraph is spending a lot of time in: can I e.g. create a faster version of some primitive calculation that all my blocks use a lot, and therefore get a speed-up across many blocks which should translate into a fast over-all application.

 Finally, if I still need more improvements I would look at collecting many blocks together into a single, larger block. This is generally less desirable, since you now have a (more) application-specific block, and it becomes harder to re-use in later projects, but if you have performance requirements that drive you there, then it absolutely is an option. At this point you likely have multiple operations being done to your incoming samples, and it becomes easy to collect all of those into a single larger VOLK call (and from there, create a SIMD-ized proto-kernel that targets your particular platform). So, while re-usability of code drives you away from this scenario, it offers the greatest potential for performance improvements, and thus is where many applications with high performance requirements tend to gravitate towards. Ideally you can strike a balance between the two: i.e. have widely re-usable blocks, but with a set of operations inside them that you can take advantage of e.g. SIMD-ized function calls to make them high-performance. If you can craft the block to be widely re-usable for a certain class of things (e.g. look at how the OFDM blocks are setup to be easily re-configurable for the many ways an OFDM waveform can be crafted). In the long-run having more knobs to turn to customize your existing code base to deal with whatever new scenario you are looking at in 1/2/10 years from now is always better than a brittle solution that solves today's problem, but is difficult to re-use to deal with tomorrow's.

Hope that was helpful. If you are interested in learning more about how to use VOLK - certainly have a look at libvolk.org - the documentation is (I think) fairly good at introducing the concepts and intent, as well as how the API looks/works. And certainly don't be shy about asking more questions here.

 Good luck,

On Sun, Feb 28, 2016 at 1:58 AM, Sylvain Munaut <address@hidden> wrote:
> Just wanted to ask the more experienced users if you think this idea is
> worth a shot, or the performance improvement will be marginal.

Performance improvement is vastly dependent of the operation you're doing.

You can get an idea of the improvement by comparing the volk-profile
output for the generic kernel (coded in pure C) and the sse/avx ones.

For instance, on my laptop : for some very simple one (like float
add), the generic is barely slower than simd. Most likely because it's
so simple than even the compiler itself was able to simdize it by
But for other things (like complex multiply), the SIMD version is 10x faster ...



Discuss-gnuradio mailing list

Doug Geiger

Discuss-gnuradio mailing list

Discuss-gnuradio mailing list

reply via email to

[Prev in Thread] Current Thread [Next in Thread]