The phenomenon Sylvain is pointing at is basically the fact that as compilers improve, you should expect the 'optimized' proto-kernels to no longer have as dramatic an improvement compared with the generic ones. As to your question of 'is it worth it' - that comes down to a couple of things: for example - how much of an improvement do you require to be 'worth it' (i.e., how much is your time worth and/or how much of an performance improvement do you require for your application). Similarly, is it worth it to you to get cross-platform improvements (which is one of the features of VOLK)? Or, perhaps, is it worth it to you just to learn how to use VOLK?
A couple of thoughts here: in general, when I have a flowgraph that is not meeting my performance requirements, my first step is to do some course profiling (i.e. via gr-perf-monitorx) to determine if there is a single block that is my primary performance bottleneck. If so - that is the block I will concentrate on for optimizations (both via VOLK, and/or any algorithmic improvements - e.g. can I turn any run-time calculations into a look-up table calculated either at compile-time, or within the constructor).
If there is not a clear bottleneck, then next I look a little deeper using perf/oprofile to look at what functions my flowgraph is spending a lot of time in: can I e.g. create a faster version of some primitive calculation that all my blocks use a lot, and therefore get a speed-up across many blocks which should translate into a fast over-all application.
Finally, if I still need more improvements I would look at collecting many blocks together into a single, larger block. This is generally less desirable, since you now have a (more) application-specific block, and it becomes harder to re-use in later projects, but if you have performance requirements that drive you there, then it absolutely is an option. At this point you likely have multiple operations being done to your incoming samples, and it becomes easy to collect all of those into a single larger VOLK call (and from there, create a SIMD-ized proto-kernel that targets your particular platform). So, while re-usability of code drives you away from this scenario, it offers the greatest potential for performance improvements, and thus is where many applications with high performance requirements tend to gravitate towards. Ideally you can strike a balance between the two: i.e. have widely re-usable blocks, but with a set of operations inside them that you can take advantage of e.g. SIMD-ized function calls to make them high-performance. If you can craft the block to be widely re-usable for a certain class of things (e.g. look at how the OFDM blocks are setup to be easily re-configurable for the many ways an OFDM waveform can be crafted). In the long-run having more knobs to turn to customize your existing code base to deal with whatever new scenario you are looking at in 1/2/10 years from now is always better than a brittle solution that solves today's problem, but is difficult to re-use to deal with tomorrow's.
Hope that was helpful. If you are interested in learning more about how to use VOLK - certainly have a look at
libvolk.org - the documentation is (I think) fairly good at introducing the concepts and intent, as well as how the API looks/works. And certainly don't be shy about asking more questions here.
Good luck,
Doug