Marcus, thank you very much for your deep answer.
Your proposal about the usage of your_block::detail()->input(0)->buffer()->link seem very interesting, so I shoulld evaluate if use this mechanism or only some memory allocation created with malloc.
I'd prefer to simplify the life of user as much as possible..I'll put my blocks on a repository, hoping that some user will use these blocks.
Your proposal of a hierarcly approach is interesting,but in my case could be more problematic: user can have a DAG or graph flowgraph,and the reserch of the next block,performed by each involved block,could add some complexity, considering that I need to open connection only between adjacent blocks.
I summarize what I'm doing: as thesis ,I'm trying to make a partial porting over CUDA..so I'm re-implementing blocks with CUDA. Each block allocate a circular buffer of device pages,passing these pointer with other info to the next block(using a tag with a pointer to host memory). Now I want to establish an initial handshake, because the following block could have preferences(usually not mandatory,because often the first step of a block is to copy data from global to shared memory,allowing some degree of freedom) about incoming data(e.g. min-max dimension of each page,min_multiple for input data of FFT,..). Each block receive data throught device memory,elaborate all data and return a single useless tap,in order to wake up other blocks..at the end of chain,data will be copied into host memory.