For a realtime application, where performance matters. What is the general view of getting better performance. Using threads spawned on the fly (pthread_create), or preallocate group of threads with signalling (cond/locking) ?

background 120Hz graphics update, heavy data processing, but too much data and programming complexity to use GPU/OpenCL. Have a fixed number of parallel jobs (6). In one cycle, peices of work can be forked off, while others remain stage based.

Any views would be good to digest.