Post
Topic
Board Beginners & Help
Re: further improved phatk OpenCL kernel (> 2% increase) for Phoenix - 2011-07-07
by
1MLyg5WVFSMifFjkrZiyGW2nw
on 09/07/2011, 22:44:39 UTC
Yea, I was thinking the same, but then I thought that there may be some smart way to rearrange some of the SHA-256 algorithm to change simple bit shifts, exclusive-ors and addition into more complex multiplies, maybe by carrying out two or three operations at once. But it would require a deep knowledge of the SHA-256 algorithm and binary maths. Something along the lines of way that Laplace can be used to solve differential equations by transferring everything into the S-domain, solve with addition and subtraction and then transfer back for the answer.

While not a math guru, I am certain this can't be done. The algorithm uses these kinds of operations:

- rotate or shift bits right by three different numbers of places, then XOR together
- select bits from one of two values, depending on bits in a third
- majority of bits set/clear in three values
- addition of the result of these operations and constant values for each round

you can build multiplication out of these operations if combined in a certain way, but the SHA-256 algorithm does not use them like this. If (parts of) SHA were equivalent to something as simple as multiplication, I'd say it could be broken in no time.

Also, SHA256 uses 32 bit values for everything. You could of course implement it on an 8 bit machine, but this would make it much slower. And having 24 bit wide registers does not even mean you could run three 8 bit ops at the same time

Another thought I had:
Is aggressive loop unrolling really helping performance? At least for FPGAs, I guess that lots of very small units that maybe do one hash every 64 clock cycles could be better than a much bigger unrolled design, and the same could be true for GPUs. Was this already tested or did everyone start with the assumption that unrolling is the way to go?