Even in assembly, we have very few control of the cache level. Just a few instructions: the movntdqa and some prefetchnt (nt == non temporal == no cache). JCE mostly let the CPU handle everything, except in "use_cache":false mode.
Progressing on assembly for multi-hash on AES-64 (the easiest to write).
I reach 79 h/s on Hexa-hash on CV-v7, which is very bad, but when using IPBC where the JCE multihash does marvels, i get some interresting results. More tests and optims to do.
Surprisingly, i didn't run out of registers in x64 with 6 hashes at the same time. But on 32 bits it will be terrible. I'll provide hexa-hash 32-bits for code symetry, but probably useless.
edit : i reach 1700 h/s on on IPBC my stock ryzen 1600 with JCE 0.26 and its multihash, with config multi
3+1+1+1+1+1+3+1+1+1+1+1
so 12 threads, two triple and ten simple. Curiously the double are less good...
That's the fastest combination I found. On CN-v7 still cannot beat the default 8x simple
Damn! Awesome for a stock hexa-core Ryzen!
How to setup this? I can do some testes in Ryzen 7 to see the scalling with 8 cores (but the same cache)