Re: Tenebrix scaling questions

Quote from: kano on September 28, 2011, 11:49:23 PM

But my comment about that table is that either the first line is wrong or misread.

A HD6970 will do 100x a typical CPU and certainly at LEAST 10x any CPU the same price as it.

That table says double-sha256 1:120
So unless it is totally crap, that can only mean CPU:GPU

Then the next line says scrypt(1024,1,1) 1:5.2
Which would mean GPU is 5.2 times faster.

Now, I have no idea where he pulled those number out of ... but reading that table says GPU will be 5.2 time CPU on N=1024,p=1,r=1
I'm not saying the table is based on real info, just that it doesn't match what's written and just switching the numbers would simply mean they are not reliable at all.

No, thats *one 3GHz K10 CPU core* vs. a HD6970
A 3GHz K10 core is roughly 3Mh/s doing double-sha256 aka bitcoinhash.
A stock HD6970 is about 360-400Mh/s for bitcoinhash.
Hey look, 3:360 == 1:120, a miracle!

So yes, the factor of 1:5.2 is for ONE CPU CORE vs. a 6970.
For the math or comprehension impaired, thats "A hex(=6)core PhenomII at 3GHz is a few % faster than a HD6970"
And this is from actual measurements of real code running on real devices, not some "well, in theory it *should* be possible" numbers that overlook massive problems on real hardware.

So... where is any written performance info for GPUs doing scrypt variants? The original paper only deals with VLSI, any citations?

So I went for the "well, to get some numbers I'll actually have to try to write a optimized implementation" approach.

So... maybe your GPU is 100 times faster than a CPU executing pure ALU code, but to do that you actually need to... like... store the per-thread state somewhere?
So, where to put 128KiB (or 32 KiB using a factor 4 time/space tradeoff) of memory for V *per thread*?

Registers? Way too small.
LDS - works, but you only got 64KiB/CU so you end up with a max of 2 threads/CU even with a factor 4 time/space tradeoff
GDS? way too small again.
L2 read cache? You got 512k of that, so with a /4 reduction you can keep 16 threads worth of V arrays there, but that's not really helping a whole lot due to reasons I'll explain shortly.
External GDDR5? Completely horrible for randomly reading 128 byte items all over the place.

Now, ATI GPUs are funny beasts, they use something best described as "hyperthreading" similar to a niagara CPU to hide their 4-clock instruction latency. Which means you need to be executing 64 threads in parallel to make full use of a CU (16 VLIW4 cores * 4 deep pipeline).

Best option so far is "use LDS, live with only being able to run 2 threads/CU (due to LDS size)... welll... thanks to what I explained above, that means you're effectively getting ALU throughput of half of one VLIW4 core per CU per clock.

So a whole 6970 ends up effectively performing as if you only had 12 fully utilized VLIW4 cores.

Well, turns using this approach in the real world, at 880MHz core clock and with overhead it actually manages roughly 5.2 times the performance of a single 3GHz K10 core.

So, if you have any constructive suggestions on what other approaches to try to wrangle better performance out of a AMD GPU for this variant of scrypt, I'm listening.

Inb4 obvious, first thing I tried was the KISS "just allocate a huge buffer for V arrays in global memory, run 64-256 threads and have the mem and L2 controllers figure it out." approach. also with time/space tradeoff factors of 2 to 16. Ends up slower than going with LDS.