|Subject:||[Discuss-gnuradio] Pipelined processing with the Thread-Per-Block scheduler?|
|Date:||Tue, 9 Nov 2010 16:34:42 +1100|
I conducted a simple experiment (using GRC) to test the TPB scheduler’s performance, and following a search here, I cannot find any definitive information that would explain the observed behaviour. I kindly request your thoughts on the matter:
Three flow graphs were created in separate GRC documents. No graph uses throttling. Tests were run on a dual-core Linux machine using a 3.3git release.
1) One graph: a high-rate signal source connected to a resampler, which is in turn connected to a null sink.
2) Two identical disconnected sub-graphs: each contains a high-rate signal source connected to a resampler, which is in turn connected to a null sink (i.e. as above, just twice).
3) One graph: one high-rate signal source whose output is connected to the input of two separate resamplers, each of which is connected to its own null sink.
‘High-rate’ means a few Msps, and the resamplers output data at a similar rate (e.g. 8MHz, decim/interp=4:3).
Thanks to the TPB scheduler, (2) uses 100% CPU (max load on both cores) as the sub-graphs are disconnected.
However when running (1) and (3), only 50% utilisation is observed. I also placed ‘Copy’ and ‘Kludge Copy’ blocks before the resampler inputs in (3), but this did not increase performance (which makes sense given the assumed flow model below).
I am not aware of the intricacies of the asynchronous flow model used, or the TPB scheduler (I only skimmed the source), but I wonder why (1) and (3) do not use more than 50% CPU?
Please excuse any gaps in my understanding, but my thoughts are as follows:
Asynchronous producer/consumer and push/pull graphs are obviously quite complicated to get right in all circumstances (I pulled my hair out designing one), and there are a number of ways data can be passed between blocks – doubtless to say, GR generally does an excellent job of this. In the particular scenario of (1) and (3) though, is the performance bottleneck the manner in which that data is passed around, and how/when the blocks’ production/consumption state, and thread state, is changed? I’m not sure if a push or pull model is used without a clock or throttle, but does the signal source block because it must wait until its own internal production buffer is consumed by the resampler? So therefore the currently running thread switches back and forth between the signal source and resampler? This (in my mind) rests on the assumption that the buffer (memory region) that is passed to the general_work of the resampler actually lives inside the signal source block, and there is no direct control over how much of that buffer is consumed in one iteration of the connected block’s (in this case the resampler’s) general_work, aside from indirectly via forecast in the connected block? Or is that not the case?
This (empirical and thought) experiment should be framed with a view to pipelining. Ideally, as the graph is not throttled, the threads should seldom block and utilisation for (1) should be close to 100%, and (3) should be slightly less on dual-core (because in the best case only the signal source and one resampler can run at any one time). This would rely on produced data either living on-the-wire (connection) between blocks, or in the input stage of a connected block – of course this comes with restrictions and overheads (I’m not sure what the base-class block does in regards to managing the data buffers passed to/from general_work). For (3), the data (memory block) produced by the signal source would be read-only, and therefore could be simultaneously processed by the two resampler blocks on separate cores, thus achieving greater throughput.
Is a major architectural change required to realise this? Or if it has already been considered, are the overheads potentially so large that it would degrade performance?
Thanks for your thoughts,
|[Prev in Thread]||Current Thread||[Next in Thread]|