I want to use only 1 thread per physical core on both the CPUs, but I'm not able to do it.
Use the tool I've posted a few posts above (
msg27696971) to identify physical cores, and then use the appropriate affinity mask.
Do not underestimate the correct affinity setting, some algos (like lyra2*) are spread between all cores if
thread count < core count, e.g. u have 4 cores but wanna use 2 threads, than u may get all 4 cores 50% utilized, and that is not what you usually want :-)
Interesting and probably faster because it's benchmark tested. He had the benefit of seeing the results
and tweaking. I gave up on super-optimizing memcpy and went with a simpler approach because all I
wanted was to avoid some the overhead to detect alignment, odd sizes and vector capabilities.
I've just benched that variant for cryptonight (using it in skein, keccak, jh and blake256) and sse2 build, setting correct L3 cache size in header manually for 4 particular cpu pieces (Pentium G620, Pentium G4600, i3-7350k and i5-7600)
So, the speed "boost" for resulting hashrate is ~ +0,006% yay

And this also costs +5% for resulting binary size.
Will continue soon with other algos, got an interest for avx version together with avx algos.