You are implying that the memory latency is somehow important, but this is also bullshit.
Interesting. So the GPU threads stall until the memory read is completed (given that for the full scratchpad, each blockmix cycle needs a 128 bytes read from an address generated by the previous blockmix). It makes sense for the huge number of threads available on GPU, but I wonder if this approach works with FPGA too (using external SDRAM). Using internal block RAM to hold the thread state (B/Bo) and switch threads while waiting for the SDRAM. Not sure that works actually. Food for thought, thanks.
PS ssvb, you have some
very interesting threads linked in your post history. Thank you for posting here, I'm late to this party and this helps enormously.