Thanks very much for that explanation, I've been working off the original scrypt.c in cgminer (the OpenCL GPU code is rather beyond my ken), but your cell code does look useful.
I believe that similar pipelining for hiding the latency of external DRAM accesses can be also easily implemented with FPGA or ASIC. But FPGA or ASIC still must have a lot of memory bandwidth even after the scratchpad size reduction, otherwise the external memory will become a performance bottleneck. Beating the GPUs equipped with fast GDDR5 is going to be a tough challenge.
I'm just playing with the FPGA implementation as a hobby, though I'm hoping it may be of some use with all those bitcoin FPGA boards that are going to be just junk in a few months as the nethash climbs exponentially. So my code just uses the internal block ram resource, which is very limited (4.5Mbit on an LX150, enough for 4 full or 9 half scratchpads). Fitting the cores is no problem, but routing them is a nightmare. I've been recently been looking at pipelining, but it seems this just makes it even more unroutable. Still, there may be a way forward and your input was very welcome.
Jasinlee has his own project looking at using external SDRAM, which I guess will look a lot like a GPU style solution (exactly the same problems with ram bandwidth and latency).
Well, I'm already away from the party since long ago

I wish you well. I'm currently reading through your (and D&T and others) old threads for inspiration, so your words do live on
