|
From: | Balint Seeber |
Subject: | [Discuss-gnuradio] Pipelined processing with the Thread-Per-Block scheduler? |
Date: | Tue, 9 Nov 2010 16:34:42 +1100 |
Dear all, I conducted a simple experiment (using GRC) to test the TPB
scheduler’s performance, and following a search here, I cannot find any
definitive information that would explain the observed behaviour. I kindly
request your thoughts on the matter: Three flow graphs were created in separate GRC documents. No
graph uses throttling. Tests were run on a dual-core Linux machine using a
3.3git release. 1)
One graph: a high-rate signal source connected to a
resampler, which is in turn connected to a null sink. 2)
Two identical disconnected sub-graphs: each contains a
high-rate signal source connected to a resampler, which is in turn connected to
a null sink (i.e. as above, just twice). 3)
One graph: one high-rate signal source whose output is
connected to the input of two separate resamplers, each of which is connected
to its own null sink. ‘High-rate’ means a few Msps, and the resamplers
output data at a similar rate (e.g. 8MHz, decim/interp=4:3). Thanks to the TPB scheduler, (2) uses 100% CPU (max load on
both cores) as the sub-graphs are disconnected. However when running (1) and (3), only 50% utilisation is
observed. I also placed ‘Copy’ and ‘Kludge Copy’ blocks
before the resampler inputs in (3), but this did not increase performance
(which makes sense given the assumed flow model below). I am not aware of the intricacies of the asynchronous flow
model used, or the TPB scheduler (I only skimmed the source), but I wonder why
(1) and (3) do not use more than 50% CPU? Please excuse any gaps in my understanding, but my thoughts
are as follows: Asynchronous producer/consumer and push/pull graphs are obviously
quite complicated to get right in all circumstances (I pulled my hair out
designing one), and there are a number of ways data can be passed between
blocks – doubtless to say, GR generally does an excellent job of this. In
the particular scenario of (1) and (3) though, is the performance bottleneck
the manner in which that data is passed around, and how/when the blocks’
production/consumption state, and thread state, is changed? I’m not sure
if a push or pull model is used without a clock or throttle, but does the
signal source block because it must wait until its own internal production
buffer is consumed by the resampler? So therefore the currently running thread
switches back and forth between the signal source and resampler? This (in my
mind) rests on the assumption that the buffer (memory region) that is passed to
the general_work of the resampler actually lives inside the signal source
block, and there is no direct control over how much of that buffer is consumed
in one iteration of the connected block’s (in this case the
resampler’s) general_work, aside from indirectly via forecast in the
connected block? Or is that not the case? This (empirical and thought) experiment should be framed
with a view to pipelining. Ideally, as the graph is not throttled, the threads
should seldom block and utilisation for (1) should be close to 100%, and (3)
should be slightly less on dual-core (because in the best case only the signal
source and one resampler can run at any one time). This would rely on produced data
either living on-the-wire (connection) between blocks, or in the input stage of
a connected block – of course this comes with restrictions and overheads
(I’m not sure what the base-class block does in regards to managing the
data buffers passed to/from general_work). For (3), the data (memory block)
produced by the signal source would be read-only, and therefore could be
simultaneously processed by the two resampler blocks on separate cores, thus
achieving greater throughput. Is a major architectural change required to realise this? Or
if it has already been considered, are the overheads potentially so large that
it would degrade performance? Thanks for your thoughts, Balint |
[Prev in Thread] | Current Thread | [Next in Thread] |