>> So what was the purpose of dumping here this garbage C++ edition then? Why not using some pseudocode or intrinsics which would show the idea about the efficient SIMD implementation of scrypt?
Coz I posted the code which could be simplified using NAND/NOR-optimizations. Assembler edition is very hard to read. I wasn't going to show efficient SIMD implementation.
>> Better instructions scheduling on superscalar processors for making use of triple issue is not using "less calculations", it is more like just doing "the same calculations faster".
English is not my mother tongue, sorry if I failed to explain my idea. In few words its like "CPU spends less time".
>> That's an ordinary performance optimization and has nothing to do with the "quality of Scrypt" and does not help to "get rid of redundant XORs". I had a hope that you have something more impressive to show and was somewhat disappointed.
Look at my function named Salsa(). Each cell from 16-vector uiA is XORed 4*2=8 times. My phrase was
"Imagine what will happen if someone get rid of redundant XORs of 16 temporary memory cells in SALSA..."
It's not about the posted code, it's about NAND/NOR-optimization. I didn't say that posted code eliminates any XORs.
>> You have 3 loops looking like "for (i=0; i<32768; i+=16) { ... }" in your code. The first two of them are walking over the buffer sequentially one after another. And because the buffer is larger than L1 cache ("a CPU with 128K L1 per core is quite... uncommon" as you have noticed), you get L1 cache misses on each of these loops.
I thought so before I tried both ways. CPU understands when code do sequential memory access and prefetches data into a cache before the code needs it.
>> Merging these first two loops into one would reduce the number of L1 cache misses and also save some load/store instructions. This would actually make the code look more like the reference implementation.
Merged loop accesses 3 different chunks of memory. This leads to more MOVDQA instructions and makes CPU to maintain 3 flows of prefetching. My edition needs only 2 flows which improve memory access.
>> The thing that actually *might* make some sense is to have the data buffers for 4 hashes interleaved. So on the first loop you just run the SIMD code in exactly the same way as scalar code without the need for any shuffle instructions. However this data layout becomes problematic on the last loop because you need to access different memory areas for different hashes ('j' indexes are different for each of the simultaneously processed 4 hashes) and gather scattered data into different lanes of SIMD registers. Introducing the intermediate loop for deinterleaving the data in addition to XORing it would fix the problem. But the data needs to be interleaved again on the last loop. The disadvantage is extra interleaving/deinterleaving overhead. The advantage is no need for shuffles and possibly better instructions scheduling. Is this what your SSE2 code is actually trying to do?
I got rid of PSHUFDs this way. I collect data from uiStorage to be XORed with 4 hashes into a register using 4 other registers as pointers (EAX, EBX, ECX, EDX). Disadvantage of PSHUFDs is much bigger than disadvantage of collecting data into a register. EAX, EBX, ECX and EDX r set in previous loop iteration, so when I realy need this data it's already in a cache.