Re: Pollard's kangaroo ECDLP solver

Quote from: NotATether on June 30, 2024, 05:03:38 PM

Quote from: kTimesG on June 30, 2024, 02:24:46 PM

If no carry after 2nd reduction = all good;
else add "r".

But to check the carry we must do an addition with 0, so we have 5 operations if carry is set, or 1 if unset (plus the condition check).

I wonder what is the computational cost of executing a conditional branch on CUDA though.

I know that on Intel platforms, it would've taken longer to do a CMP/JNZ than a simple bitshift. So maybe in case jumps are expensive in CUDA too, there would be something like

- create a new variable to store the carry, run UADDC(0,0) on that
- carry bit will be zero at this point
- do a UMULLO(the r value you quoted, carry) - ie. multiplication and store lower 64 bits. This potentially avoids an if statement of it is correct
- Add this the result to the value in the code that we are reducing

In this case, if there's no carry, the above is a no-op and can be optimized out by the kernel, otherwise it reduces to the previous code.

So what do you think about this?

Looks like you are right. And also we can do the fused mul-add directly into r[0]. I tested 3 different versions, ran a kernel that only does 256x256 multiplications in a very long loop.

I think the IF causes some small overhead. However it seems the fastest (but almost as fast as v3 below) when the fix is part of the batch addition, but slowest when benchmarking the multiplication itself. This may be because there's a lot of branches happening in other parts (like the inverse) and threads diverge a lot anyway during that phase.

Renamed some PTX macros to be more readable.

I think we need a parallel constant-time version for GCD... hard to find one.

Code:

// ADD_CARRY_U64(r[3], r512[3], 0ULL); // wrong - carry ignored
// set carry flag after last add
ADD_CARRY_U64_CO(r[3], r512[3], 0ULL); // r3 = result[3] + carry

// get carry
ADD_CARRY_U64(al, 0ULL, 0ULL); // al = carry

// v1 - r = r0 + (carry * r); slowest
// al = PRIME_R * al; // al = r * carry
// ACCUM_ADD_U64_CO(r[0], al); // r0 += al
// ACCUM_ADD_CARRY_U64_CO(r[1], 0ULL); // r1 += carry
// ACCUM_ADD_CARRY_U64_CO(r[2], 0ULL); // r2 += carry
// ACCUM_ADD_CARRY_U64(r[3], 0ULL); // r3 += carry

// v2 - if(carry) r0 = r0 + r
// if (al) {
// ACCUM_ADD_U64_CO(r[0], PRIME_R);
// ACCUM_ADD_CARRY_U64_CO(r[1], 0ULL);
// ACCUM_ADD_CARRY_U64_CO(r[2], 0ULL);
// ACCUM_ADD_CARRY_U64(r[3], 0ULL);
// }

// v3 - multiply-and-add; 46 less SASS ops; fastest
ACCUM_MUL_ADD_LO_U64(r[0], al, PRIME_R); // r0 = carry*r + r0
ACCUM_ADD_CARRY_U64_CO(r[1], 0ULL);
ACCUM_ADD_CARRY_U64_CO(r[2], 0ULL);
ACCUM_ADD_CARRY_U64(r[3], 0ULL);