Re: [PicoStocks] 100TH/s bitcoin mine [100th]

Quote from: bitfury on February 07, 2013, 02:19:48 PM

...

Quote from: kano on February 01, 2013, 02:00:55 AM

Waiting for a dev board so I can write the cgminer driver for you

Edit: of course, contact me if you want any suggestions about the MCU design (not doing the design, just optimal details of it's design)

Well. The chips will work in strings with SPI protocol using state-machine. This was tested and found to be nice in second generation of my FPGA boards. I.e. instead of device addresses I just have prefix code that triggers state machine of devices and allow to access chain. From software point of view - sending new jobs and getting results is just as feeding big buffer into SPI and simultaneously reading values where will be answers. This can be done in single thread very efficiently even on slow ARM CPUs.

The goal that I have is about single ARM cpu per 1200-1500 chips that's 3.6 - 4.5 TH/s. So the question is that code should be quite efficient to handle that. Also requests are double-buffered (this means that while one job is processed in chip, another job is pipelined). With ASIC unlike of FPGA job processing would take about 0.3 - 0.4 milliseconds of time. This means that there should be likely not less than one communication every 0.15 ms.

Last time I tried to adapt cgminer for that purpose for much smaller task - 24 spartans, I had to make 48 threads for double-buffering. And to me that seems as complete nonsense. As for 1200 chips it won't work (2400 threads).

Likely I plan that code structurally would look like asynchronous state machine for I/O with bitcoind/pool with protocol like stratum or Luke's getblocktemplate. Second thing - that job generation could be done quickly from template in synchronous fashion when making up request buffer to chips. Then separate thread for SPI I/O - i.e. prepare request buffer, spit it out to SPI while simultaneously reading back data, parsing answer buffer and either send updates to chip and send results to network. I think that cgminer codebase is not well-suited for that - a lot of work would be required to redesign. However cgminer's monitoring is nice compared to what I typically wrote :-)

Yes the performance and original design of the 'old' FPGA code in cgminer is directly related to FPGA (and had no foresight)
To be blunt, serial-USB sux.
That is why over the last 2 months I've been rewriting all that, getting it ready for ASIC - direct USB - which will also give the option to use any USB I/O available with the device - not just the simple serial-USB back and forward that hides anything else available.

The GPU code, on the other hand, is very well designed and handes I/O to a device with MUCH tighter requirements.

Two different people did the original design of those two pieces of code ... ckolivas GPU, Luke-Jr FPGA ... yes I'll stop there and let the code speak for itself.

The current work handling code is based on the idea that a device can only handle one item of work at a time.
ckolivas and I will be rewriting that shortly, since the BFL MCU device has a 20 work input queue (and the Avalon requires ~24 work items at a time) and thus dealing with a device that handles more than one item of work at a time will also make it simple to resolve issues of thread counts etc.

Now ... 0.3ms is too small IMO - doubling to have one job pending - 0.6ms is also too small IMO (the BFL queue design says 20 work items)
So if your queue only allows one work item waiting in it then the code still has to hit a target, that it is going to (sometimes/often?) be late for, due to USB and OS constraints
However, if you are only designing it for in-house use, not a general board to be sold to users, then you can of course optimise the choice of the hardware talking to the USB device and thus minimise the problem there of course.

Anyway, making the MCU queue (or whatever you call it in your device) larger means the code has a much wider target to hit and a much lower chance of it not keeping the queue from going empty.
When the queue is empty, there is idle time, and making the queue a bit bigger will help ensure maximum performance by reducing the possibility of that idle time.
BFL have specified a queue on both work and replies.

It would also be good to have a secondary I/O end point to wait on replies (and have a queue there also)
So two separate threads, one for sending work (and performing device status work requests and replies) and a second one getting work answers

I am going to BFL in a bit over a week to see what hardware they really have and hopefully point a new cgminer at it and get results

Though I'm a software guy, not a hardware guy ... as should be obvious from the above.