Re: Bitcoin puzzle transaction ~32 BTC prize to who solves it

Quote from: Etar on Today at 07:21:01 PM

Quote from: kTimesG on Today at 03:39:49 PM

On-chip registers are insanely more times faster than reading and writing data from and into GPU memory, which is what the JLP kernel is all about. The loops are also in the bad order if what we care for are the jumps, not the kangaroos.

Maybe you wouldn't agree on this, but it is a really bad idea to dump millions of kangaroos into memory and read/jump/store them, to achieve the same efficiency as less kangaroos with a correct average jump size. Why? Because you can have a much much better efficiency by using fewer, faster kangaroos, and have the same results, but much faster.

Let's say instead of 128 kangaroos, we launch 4 kangaroos per thread. In this case, each kangaroo makes, say, 32 jumps.
To get real coordinates after inverse batch, you need to store x,y,z coordinates of 32 points for each kangaroo somewhere.
I don't understand how you store at least 128 x,y,z coordinates in registers. You'll have to use global memory. Shared memory won't fit all this.
Can you explain a little?

Why would you ever need 128 coordinates to be stored, if you only have 4 kangaroos jumping at the same time?

My kernel does an insane amount of number of jumps per each kernel call, without any need to read and write useless information to global memory, except the initial and final landing spots, and potential found DPs metadata. And this only requires a very low amount of actual GPU memory.