It really does need to happen on one sample at a time - at least assuming I use the same algorithm I'm using now. I am pretty much using the method Sylvain suggests. The rotation is one operation of many inside a block - ie. many rotations happen per call to work(), but one rotation per input sample as the rotation is dependent on what happened with previous samples.
Still processing what Tom/Doug are suggesting otherwise. The mod is generally product by something that isnt on GNU Radio. When we recreate the mod in software we definitely use a form that is easily done in vector form. Haven't quite wrapped my head on how to do the same on the receiving end while achieving optimum detection... Got any good papers?
Converting everything to phase might be a half-way reasonable approach.
Imminently, I only need to make this ~22% faster. It's possible this might work on a faster processor.