few benchmarks on yescryptr16
3.7.7 "v1" (gcc 4.8.3)
avx - ~930 h/s
sse2 - ~ 870 h/s
3.7.7 v2 (gcc 5.3.1)
avx - ~950 h/s
sse2 - ~970 h/s
3.7.7 4ward (gcc 6.2.1)
avx - ~970 h/s
sse2 - ~960 h/s
additional algos that show better performance in sse2 than avx (although very small):
yescrypt, poltimos and lbry
This shows gcc-5.3.1 might be the issue. Is that on a Coffeelake? If not it eliminates that as a Coffeelake
issue and looks purely like a compiler version issue.