Post
Topic
Board Hardware
Re: Nanominer - Modular FPGA Mining Platform
by
Inspector 2211
on 12/02/2012, 05:50:25 UTC
wondermine, I wish you the best. I really do.

However, please take a look at the SHA-256 algorithm. http://en.wikipedia.org/wiki/Sha-256
The 32 bit values b, c, d, f, g, and h are trivially derived from the previous round, i.e. copied from a, b, c, e, f, and g, respectively.
The 32 bit value e is derived from the previous round's d, h, e, f, and g (i.e. 5/8th of the previous round's 256 bits are used to derive it).
The 32 bit value a is derived from the previous round's h, e, f, g, a, b, and c - i.e. 7/8th of the previous round's 256 bits are used to derive it.

Now think this through over just one more round. Only four 32 bit values are trivially derived from their "grandfather" round.
The other four 32 bit values are derived from brutal mixing of almost all bits of the grandfather round.

And so on.

After just 4 rounds, a single bit change in the great-great-grandfather round influences ALL bits of the current round.

Thus, any notion of shaving more than 4 or 5 rounds off the 128 total rounds is a pipe dream.

In other words, speeding up an implementation of SHA-256 cannot be done by mathematical tricks.

Rather, the operations of each round should be optimized.

There is no real reason why the clock is a measly 200 MHz (and thus the clock cycle 5 ns) in the best currently available implementations,
such as the ZTEX implementation. Think about it: 5 ns, that is a delay straight from the 70s. A TTL technology-like delay. Certainly we can do better than that?!?

Analyzing the operations for their contribution to the delay yields:
rightrotate ... instant, no delay at all
xor ... minor delay, bit by bit, probably a few dozen picoseconds
and ... minor delay, bit by bit, probably a few dozen picoseconds
not ... minor delay, bit by bit, probably a few dozen picoseconds
+ ... this should be scrutinized. A 32 bit add operation can be quite costly and the fastest possible implementation should be pursued.
      Adding insult to injury, SHA-256 features not just binary or ternary adds, but 4-fold adds (in the t1 function) and 5-fold adds
      (e := d+t1) and 6-fold adds (a := t1 + t2).

So, there you go. The biggest detriment to performance is probably the 6-fold 32 bit wide add in a := t1 + t2.
If you can speed this operation up, maybe by pre-computing partial results in the PREVIOUS round, then bringing them to the table when needed, the entire SHA-256 will be sped up (assuming optimal placing and routing).

I'm very familiar with the SHA-256 algorithm and understand its complexity.  And I don't mean from reading Wikipedia.  What you fail to capture here is that I'm not looking at this to know exact values to avoid...  it's a matter of probability.  Why does SHA-256 use 64 opaque rounds? Precisely so something like this does not happen.  It might be a waste of time and resources if checking a value wasn't so resource-friendly, but it's not.

As far as "more standard" optimizations, I already mentioned I would do that.  And I know adding is the biggest resource hog here.  I'm looking into the best ways to do that, precalculation, DSP chips on custom boards (to offload logic that does not require programming and other benefits)...  I may be a student but I'm not exactly lacking in mathematical or engineering understanding.

The 200MHz issue... Actually there is a reason we're down in the "70s" range of timing.  Optimizing for high clockability is a huge challenge. It is a problem, also something I'm already looking into.  I'm looking into quad-pumping and other technologies that major manufacturers use.  I'm going to take it from your knowledge of some of this that you understand how hard it is to come by clean clocking, and then making sure that doesn't become unstable.  Have you looked at commercial IP for SHA or other block ciphers? They don't run much higher than 200MHz stably, and they cost thousands of dollars and have had many engineers optimize the hell out of them.

So all that to say, yes I know what routes of action to take.  I also won't say I'm looking into something unless I believe it's feasible and have done adequate research.

>to offload logic

You can't offload logic.
Not enough IO pins on an FPGA.
There are seven additions per round, 125 rounds - do you want to connect 875 external adders?
These 875 external adders would need 96 pins (64 inputs, 32 outputs) connected to the FPGA each, for a grand total of 84,000 pins...