Not only coffee lake has bottlenecks...
lyra2z330, Core i5-7600 (non-k) locked at 3.9GHz @ all cores, 16Gb DDR4-2400 dual channel
2 threads w/o affinity @ AVX2 build => ~830 h/s, results as ~50% load for each of 4 cores
2 threads --cpu-affinity 3 @ AVX2 build => ~865 h/s (this is interesting), results as ~100% load for cores 0 and 1
4 threads w/o affinity @ AVX2 build => ~790 h/s (that's a crap)
4 threads --cpu-affinity 15 @ AVX2 build => ~792 h/s (that's a crap, too)
I guess these cpus need 4 channel ram to perform at full speed

There is a tool which can show L3 cache usage
https://www.cpuid.com/softwares/perfmonitor-2.html, it is old as heck, but still works on some (!) configurations. It could work on i7-6700 and help with optimizations. I will also try to use
https://github.com/opcm/pcm which supports Intel's cache monitoring technology.