Re: further improved phatk OpenCL kernel (> 2% increase) for Phoenix

Quote from: Bert on July 08, 2011, 08:18:32 AM

I've been toying with an idea, but I don't have the necessary programming skills (or knowledge of the SHA-256 algorithm) to implement anything.

http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_APP_SDK_FAQ.pdf

Quote

41. What is the difference between 24-bit and 32-bit integer operations?

   24-bit operations are faster because they use floating point hardware and can execute on all compute unts. Many 32-bit integer operations also run on all stream processors, but if both a 24-bit and a 32-bit version exist for the same instruction, the 32-bit instruction executes only one per cycle.

43. Do 24-bit integers exist in hardware?

   No, there are 24-bit instructions, such as MUL24/MAD24, but the smallest integer in hardware registers is 32-bits.

75. Is it possible to use all 256 register in a thread?

   No, the compiler limits a wavefront to half of the register pool, so there can always be at least two wavefronts executing in parallel.

http://developer.amd.com/sdks/amdappsdk/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf
Page 4-62

Quote

24-bit integer MULs and MADs have five times the throughput of 32-bit integer multiplies. 24-bit unsigned integers are natively supported only on the Evergreen family of devices and later. Signed 24-bit integers are supported only on the Northern Island family of devices and later. The use of OpenCL built-in functions for mul24 and mad24 is encouraged. Note that mul24 can be useful for array indexing operations.

http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=144722

Quote

On the 5800 series, signed mul24(a,b) is turned into

Code:

(((a<<8)>>8)*((b<<8)>>8))

. This makes it noticeably SLOWER than simply using a*b. Unsigned mul24(a,b) uses a native function. mad24 is similar. I made some kernels which just looped the same operation over and over:
signed a * b: 0.9736s
unsigned mul24(a,b): 0.9734s
signed mul24(a,b): 2.2771s

So anyhow what I was thinking was the following

Current kernel: 1 * 256 bit hash / 32int = 8 32bit operations (speed 100% )
Possible Kernel: 3 * 256 bit hash / 24int = 32 24bit operations (speed a maximum of 166% [5 times faster divided by 3 SHA-256 operations in parallel])^*

^* It may actually end up being slower than the current kernel.cl if 32bit and 24bit operations are sent as wavefronts at the same time.

There may be some merit in trying to write a new kernel.cl that uses 32 x 24bit integers to carry out 3 parallel SHA-256 operations at once faster than one SHA-256 operation using 8 32bit integers .

But not everything can be carried out as 24bit operations, only mul24(a,b) and mad24(a,b), so the 166% speed up would only be achieved if every SHA-256 operation was covered by these two operations. The new kernel.cl would be limited to modern ATI hardware (54xx-59xx,67xx-69xx), which is generally what miners are using.

But to be honest I haven't looked into the SHA-256 algorithm, so I'm not sure if parts of it could ever be rewritten to utilise mad24(a,b) or mul24(a,b). But I like thinking outside the box.

Hi Bert, sorry for not directly answering you. I checked the OpenCL 1.1 specs and yes, there faster 24-Bit integer operations are mentioned, too.

mul24 (Fast integer function.) Multiply 24-bit integer values a and b
mad24 (Fast integer function.) Multiply 24-bit integer then add the 32-bit result to 32-bit integer

But here lies the problem, AFAIK there are only additions, bit-shiftig and other bit-wise operations used for current kernel (no multiplications). So there should be no use for it on the first sight.

Dia