Post
Topic
Board Hardware
Re: Official Open Source FPGA Bitcoin Miner (Smaller Devices Now Supported!)
by
magik
on 26/07/2011, 19:38:56 UTC
hrm.... yeah been doing more testing... and it seems liek I have high LUT usage because some of the "RAM" is being inferred as LUTs?

do you get any of these messages when you compile?
Quote
INFO:Xst:3218 - HDL ADVISOR - The RAM will be implemented on LUTs either because you have described an asynchronous read or because of currently unsupported block RAM features. If you have described an asynchronous read, making it synchronous would allow you to take advantage of available block RAM resources, for optimized device usage and improved timings. Please refer to your documentation for coding guidelines.
    -----------------------------------------------------------------------
    | ram_type           | Distributed                         |          |
    -----------------------------------------------------------------------
    | Port A                                                              |
    |     aspect ratio   | 64-word x 32-bit                    |          |
    |     weA            | connected to signal           | high     |
    |     addrA          | connected to signal         |          |
    |     diA            | connected to signal           |          |
    |     doA            | connected to signal  |          |
    -----------------------------------------------------------------------

really odd....  it's not happening to all of the sha_transform modules though... it only seems to be one.... the 2nd one with the NUM_ROUNDS set to 61 it appears


also, I see things like this when it's synthesizing:
Quote
   Found 6x6-bit multiplier for signal created at line 120.
    Found 6x32-bit multiplier for signal created at line 127.
line 120 is:
Quote
assign K = Ks_mem[(NUM_ROUNDS/LOOP)*cnt+i];
line 127 is:
Quote
assign K_next = Ks_mem[(NUM_ROUNDS/LOOP)*((cnt+i) & (LOOP-1)) +i+1];
hrm... there has to be a better way to use those generate blocks to parse these values not as signals/wires to be used at runtime, but rather generate constant integers or look up tables/mux's to generate these...

edit: update
if I use this for the K and K_next assignment when LOOP == 1, I don't get the LUT messages anymore:
Quote
`ifdef USE_RAM_FOR_KS
         if ( LOOP == 1) begin
            assign K = Ks_mem[ i ];
            assign K_next = Ks_mem[ i + 1 ];
         end else begin
...
I think the problem is that K and K_next are not assigned in a clock state, thus they become asynchronous combinatorial logic - and XST can't map that to a ROM?  Or maybe it's the addition of using a multiplier output as an address selector?  Something in there XST wasn't liking for me.

also, it seems the 1st round synthesizes much differently?
for the first sha block I get this:
Quote
   Summary:
   inferred  10 Adder/Subtractor(s).
   inferred 551 D-type flip-flop(s).
   inferred  17 Multiplexer(s).
Unit synthesized.

for the 2nd block I get this:
Quote
   Summary:
   inferred  62 RAM(s).
   inferred   2 Multiplier(s).
   inferred  63 Adder/Subtractor(s).
   inferred 295 D-type flip-flop(s).
   inferred  17 Multiplexer(s).
Unit synthesized.

why are these so different!?

first off are they sharing the RAM for the K's ?  It seems only the K's for the 2nd block are generated, but Xilinx might be optimizing across the hierarchy here.  But what about the # of adders/subtractors!? only 10 in the first block? how can that be?  or is it that it's shifting the position of the adders from the digester to the higher module?


I also see this:
Quote
Synthesizing Unit .
    Related source file is "e:/bitcoin/lx150_makomk_test/hdl/sha256_transform.v".
        LENGTH = 8
WARNING:Xst:3035 - Index value(s) does not match array range for signal , simulation mismatch.
which relates to the shift register code wi:
Quote
      reg [31:0] m[0:(LENGTH-2)];
      always @ (posedge clk)
      begin
         addr <= (addr + 1) % (LENGTH - 1);

now when I look at that, I'm not sure if that's correct, so lets say LENGTH = 8.  The first line says create a 32-bit register array, with (8-2+1) elements, so 7 elements, but the addr modulous wraps around at 7 - e.g. once ( addr + 1 ) == 7, then addr becomes 0, not 7.  So we are missing the last element of the shift register.

I think this is just an indexing problem - LENGTH = 8 means 8 elements in the shift register.  so you want reg[32:0] m[0:7] or reg[32:0] m[0:(LENGTH-1)].  Then below on the addr assignment, you would want addr <= ( addr + 1 ) % ( LENGTH ).  Because using a LENGTH of 8,  xxx % 8 will always return a value inclusively between 0 and 7.

Not sure how this is even working with one of the shift registers effectively 1 element short....
edit: seems if I "fix" this, it breaks it heh..... I need to look into this
ok another edit update, it seems this code is correct because you also have a 32-bit register r in there that's separate from the m storage register.  And that also explains the different synthesis for this module.  It's using a RAM, a 32-bit register r, 3-bit register addr, 9-bit adder for next address range, as opposed to just LENGTH*32 register/FF for the other types of shift registers... not sure which one is better here


on another note, I placed 2 cores ( 4 sha256 transforms ) into the design, it said I was using 140% LUTs, but it's still trying to route it right now?  It's been running for over 12 hours though....