Thanks for the detailed reply NotFuzzyWarm. I don't want to dig much deeper, this old dog just flat didn't think about the multiple cpu\core overhead at all from a hardware design perspective. But I can certainly observe it in practice for myself when I run too many threads mining on my x-way machines, and I observed it in practice too on the high dollar midranges in my software career. Never intentionally run a conventional general purpose machine at 100%, even if it's all your resource to use, cause you will lose performance overall, even if you have tons of memory, you're gonna swap. The different algorithms do have some different limiting factors, but you gotta leave some cpu free to account for that. Performance generally decreases if you don't use n-1 threads against cores. Bunch of cpus with buncha cores, that performance penalty would seem to increase the more cpus\cores you jammed in.
In general software performance, my favorite zen answer for folks wanting to increase performance and asking me how was 'access less data'. That would almost never work in this crypticverse if your code is very tight already, pretty much foiled by required algorithm work. I studied Cryptonight and Wolf0's Monero code for cpu mining for quite a while before understanding I was never gonna get more than a ~2% performance increase with software changes, the only way I even got ~2% was native compiles with an optimization flag.
I get it good enough now, I don't want or need to understand it top to bottom. Perfect is the enemy of good.

Maybe now that I'm a Jr member or whatever I can embed my joke picture from a couple weeks ago.
Mining on a phone.
http://i.imgur.com/FhRz1pq.jpg