Start N, one for each core, for example, 8, nominally hoping for 100% or less busy. If the system is already 25% busy, it might be nice to try to leave that intact, for example starting with 6.
After a moment, review total idle time, and for say 30% idle, try 30 * N / 70 more, for 8, 3 to make nominally 96.25% busy. Or get greedy, with 4 for 105% busy.
After another moment, if the target is not approximately reached, cap the parallelism at the amount the idle time indicates, maybe plus 1 (rounded up), for instance, for 20% idle, 8 * 80 / 70 = 10. If there is essentially no increase, it might be good to lower the cap, for instance below 8. If the target is reached, these steps can be repeated as processing progresses, so if CPU use drops, parallelism can be increased, and if it increases, no more are spawned until the running count drops.
Thus, 100% CPU, if usable, is utilized on the nose, middle or tail of the operation, wherever the lower need. At times when CPU need expands to exceed 100%, use of other resources may drop, but that is all the CPU you bought. If the other resources cap the parallelism, there is not point in more, and more reduces stability and increases any rerun time if there is an interruption, because fewer reach completion before.
For instance, on network TCP transfers, often process 2 adds 5-10% and process 3 adds nothing, but 3 might be a nice level of parallelism, so you drop to 2 when a process terminates or a packet is lost and never lose that 10%, unless 2 or 3 end simultaneously. The price of process 2 is that process 1 drops from 90% to 50% speed, and process 3 takes them all to 33%, so choosing between 2 for faster unit turnaround and 3 for better total bandwidth use during job end/start or packet loss is a matter of taste and situation.
If reliability is never an issue (interruptions like network loss), overloading some resource on a host with plenty of RAM does not hurt final run time. There may be some loss when going past the number of cores, even if there is idle time, if cache hits are reduced by forcing more process changes on each core. Added cache latency can turn into critical process latency, if progress is somehow tied to event turnaround time, like a transfer with insufficient buffering.