Search content
Sort by

Showing 6 of 6 results by Zoning5264
Post
Topic
Board Development & Technical Discussion
Re: Pollard's kangaroo ECDLP solver
by
Zoning5264
on 20/02/2025, 11:02:12 UTC
Why use JeanLucPons/Kangaroo's program

On my 4060ti machine, the graphics card cannot run to full capacity, displaying 650MKey/s. The graphics card is constantly at a temperature of around 50 degrees Celsius, and the graphics card fan only rotates at 1200 rpm. It is obvious that the graphics card is not running to full capacity.

Does the program need optimization? Where is the reason?


Guys, I have a question. Is the performance of over 500 billion steps per minute in the Pollard Kangaroo good or rather not? What is the best result indicated by other forum users?

It's a very good speed if you plan on breaking 135 somewhere between after 500 and 900 years.

Thanks, I've improved a bit today:

candidate=00000000000000000000000000000076e6eda5e63ddb1b05451e1f6ba3118d16 (AVX2=6600000000000000000000000000000076e6eda5e63ddb1b05451e1f6ba3118d16) elapsed=2 d:08 h:18 m:19 s total_all=332963840000000

Me too!

candidate=0000000000000000000000719D05AAF5E5E7329CEF66C917D6B51980FA8B2CCE (AVX2=6600000000000000000000000000000076e6eda5e63ddb1b05451e1f6ba3118d16) elapsed=0 d:03 h:43 m:07 s total_all=2901745600000000000000

I think mine is a bit more improved, but no doubt you will get there!!

I thought you were a serious guy and you just forgot to change the AVX2 key in the copy of my post.
Post
Topic
Board Development & Technical Discussion
Re: Pollard's kangaroo ECDLP solver
by
Zoning5264
on 19/02/2025, 20:42:41 UTC
Why use JeanLucPons/Kangaroo's program

On my 4060ti machine, the graphics card cannot run to full capacity, displaying 650MKey/s. The graphics card is constantly at a temperature of around 50 degrees Celsius, and the graphics card fan only rotates at 1200 rpm. It is obvious that the graphics card is not running to full capacity.

Does the program need optimization? Where is the reason?


Guys, I have a question. Is the performance of over 500 billion steps per minute in the Pollard Kangaroo good or rather not? What is the best result indicated by other forum users?

It's a very good speed if you plan on breaking 135 somewhere between after 500 and 900 years.

Thanks, I've improved a bit today:

candidate=00000000000000000000000000000076e6eda5e63ddb1b05451e1f6ba3118d16 (AVX2=6600000000000000000000000000000076e6eda5e63ddb1b05451e1f6ba3118d16) elapsed=2 d:08 h:18 m:19 s total_all=332963840000000

Me too!

candidate=0000000000000000000000719D05AAF5E5E7329CEF66C917D6B51980FA8B2CCE (AVX2=6600000000000000000000000000000076e6eda5e63ddb1b05451e1f6ba3118d16) elapsed=0 d:03 h:43 m:07 s total_all=2901745600000000000000

I think mine is a bit more improved, but no doubt you will get there!!

Wow, 210271420289855072keys/s, great result, congratulations. I can't get more on my 4060.
Post
Topic
Board Development & Technical Discussion
Re: Pollard's kangaroo ECDLP solver
by
Zoning5264
on 19/02/2025, 19:57:52 UTC
Why use JeanLucPons/Kangaroo's program

On my 4060ti machine, the graphics card cannot run to full capacity, displaying 650MKey/s. The graphics card is constantly at a temperature of around 50 degrees Celsius, and the graphics card fan only rotates at 1200 rpm. It is obvious that the graphics card is not running to full capacity.

Does the program need optimization? Where is the reason?


Guys, I have a question. Is the performance of over 500 billion steps per minute in the Pollard Kangaroo good or rather not? What is the best result indicated by other forum users?

It's a very good speed if you plan on breaking 135 somewhere between after 500 and 900 years.

Thanks, I've improved a bit today:

candidate=00000000000000000000000000000076e6eda5e63ddb1b05451e1f6ba3118d16 (AVX2=6600000000000000000000000000000076e6eda5e63ddb1b05451e1f6ba3118d16) elapsed=2 d:08 h:18 m:19 s total_all=332963840000000
Post
Topic
Board Development & Technical Discussion
Re: Pollard's kangaroo ECDLP solver
by
Zoning5264
on 18/02/2025, 19:20:29 UTC
Why use JeanLucPons/Kangaroo's program

On my 4060ti machine, the graphics card cannot run to full capacity, displaying 650MKey/s. The graphics card is constantly at a temperature of around 50 degrees Celsius, and the graphics card fan only rotates at 1200 rpm. It is obvious that the graphics card is not running to full capacity.

Does the program need optimization? Where is the reason?


Guys, I have a question. Is the performance of over 500 billion steps per minute in the Pollard Kangaroo good or rather not? What is the best result indicated by other forum users?
Post
Topic
Board Tablica ogłoszeń
Re: Poszukuję dev do stworzenia kryptowaluty
by
Zoning5264
on 10/02/2025, 22:31:54 UTC
Post
Topic
Board Bitcoin Discussion
Re: Bitcoin puzzle transaction ~32 BTC prize to who solves it
by
Zoning5264
on 21/01/2025, 11:46:31 UTC
While we're waiting for RTX 5090 here's some really fast jumper for 64-bit CPUs.

This is 100% working code as I'm using it to test that my CUDA kernel jumps correctly. I really needed it to be as fast as possible so I don't get old waiting for results to validate.

This uses libsecp256k1 internal headers with inlined field and group basic operations, and does batched addition with non-dependent tree inversion loops (translation: a good compiler will use unrolling, SIMD and other CPU instructions to speed things up).

Group operations / second / thread is around 15 - 20 Mo/s on a high-end Intel CPU.

Compile with "-march=native" for best results.

No, this is not a fully-working puzzle breaker. You need to use your brain to add DP logic, saving, and collision checks. This is just the lowest-level detail: a very fast CPU kangaroo jumper for secp256k1.

This function also assumes that a jumped kangaroo can never be a point in the set of jump points, nor its opposite. This guarantee applies to my Kangaroo algorithm by design, so the logic of point doubling or point at infinity is not needed at all.

Code:
#include "field_impl.h"                 // field operations
#include "group_impl.h"                 // group operations

#define FE_INV(r, x)        secp256k1_fe_impl_inv_var(&(r), &(x))
#define FE_MUL(r, a, b)     secp256k1_fe_mul_inner((r).n, (a).n, (b).n)
#define FE_SQR(r, x)        secp256k1_fe_sqr_inner((r).n, (x).n)
#define FE_ADD(r, d)        secp256k1_fe_impl_add(&(r), &(d))
#define FE_NEG(r, a, m)     secp256k1_fe_impl_negate_unchecked(&(r), &(a), (m))

static
void jump_batch(
    secp256k1_ge * ge,
    const secp256k1_ge * jp,
    secp256k1_fe * xz,                  // product tree leafs + parent nodes
    secp256k1_fe * xzOut,
    U32 batch_size
) {
    secp256k1_fe t1, t2, t3;

    int64_t i;

    for (i = 0; i < batch_size; i++) {
        uint8_t jIdx;

#if JUMP_FUNC == JUMP_FUNC_LOW_52
        jIdx = ge[i].x.n[0] % NUM_JUMP_POINTS;
#elif JUMP_FUNC == JUMP_FUNC_LOW_64
        jIdx = (ge[i].x.n[0] | (ge[i].x.n[1] << 52)) % NUM_JUMP_POINTS;
#endif

        xz[i] = ge[i].x;
        FE_NEG(t1, jp[jIdx].x, 1);
        FE_ADD(xz[i], t1);                          // XZ[i] = x1 - x2
    }

    for (i = 0; i < batch_size - 1; i++) {
        FE_MUL(xz[batch_size + i], xz[i * 2], xz[i * 2 + 1]);
    }

    FE_INV(xzOut[batch_size * 2 - 2], xz[2 * batch_size - 2]);

    for (i = batch_size - 2; i >= 0; i--) {
        FE_MUL(xzOut[i * 2], xz[i * 2 + 1], xzOut[batch_size + i]);
        FE_MUL(xzOut[i * 2 + 1], xz[i * 2], xzOut[batch_size + i]);
    }

    secp256k1_ge * _a = ge;
    const secp256k1_fe * _inv = xzOut;

    for (i = 0; i < batch_size; i++) {
        uint8_t jIdx;

#if JUMP_FUNC == JUMP_FUNC_LOW_52
        jIdx = ge[i].x.n[0] % NUM_JUMP_POINTS;
#elif JUMP_FUNC == JUMP_FUNC_LOW_64
        jIdx = (ge[i].x.n[0] | (ge[i].x.n[1] << 52)) % NUM_JUMP_POINTS;
#endif

        const secp256k1_ge * _b = &jp[jIdx];

        FE_NEG(t1, _b->y, 1);                       // T1 = -y2
        FE_ADD(_a->y, t1);                          // Y1 = y1 - y2                     m = max_y + 2(1)
        FE_MUL(_a->y, _a->y, *_inv);                // Y1 = m = (y1 - y2) / (x1 - x2)   m = 1
        FE_SQR(t2, _a->y);                          // T2 = m**2                        m = 1
        FE_NEG(t3, _b->x, 1);                       // T3 = -x2
        FE_ADD(t2, t3);                             // T2 = m**2 - x2                   m = 1 + 2(1) = 3(2)
        FE_NEG(_a->x, _a->x, 1);                    // X1 = -x1                         m = max_x + 1
        FE_ADD(_a->x, t2);                          // X1 = x3 = m**2 - x1 - x2         max_x = 3 + max_x + 1
        secp256k1_fe_normalize_weak(&_a->x);

        FE_NEG(t2, _a->x, 1);                       // T2 = -x3                         m = 1 + 1 = 2
        FE_ADD(t2, _b->x);                          // T1 = x2 - x3                     m = 2 + 1 = 3
        FE_MUL(_a->y, _a->y, t2);                   // Y1 = m * (x2 - x3)               m = 1
        FE_ADD(_a->y, t1);                          // Y1 = y3 = m * (x2 - x3) - y2     m = 1 + 2 = 3
        secp256k1_fe_normalize_weak(&_a->y);

        ++_a;
        ++_inv;
    }
}

Easy to parallelize, let's add a wrapper that jumps a specific buffer of kangaroos:

Code:
static
void computeBatchJump(
    secp256k1_ge * ge,
    const secp256k1_ge * jp,
    U32 batch_size,
    U32 num_jumps
) {
    size_t tree_sz = (batch_size * 2 - 1) * sizeof(secp256k1_fe);

//    printf("Allocating %zu bytes for tree\n", tree_sz);

    secp256k1_fe * xz_1 = malloc(tree_sz);
    if (NULL == xz_1) return;

    secp256k1_fe * xz_2 = malloc(tree_sz);
    if (NULL == xz_2) return;

    for (uint32_t loop = 0; loop < num_jumps; loop++) {
        jump_batch(ge, jp, xz_1, xz_2, batch_size);
    }

    free(xz_1);
    free(xz_2);
}

And now, once you have a really big buffer of kangaroos, you can run the jumps on all of your physical cores:

Code:
#define JUMPS_PER_STAGE   32768

    secp256k1_ge * secp_ge = malloc(numElements * sizeof(secp256k1_ge));
    secp256k1_ge * secp_jp = malloc(NUM_JUMP_POINTS * sizeof(secp256k1_ge));

    // init the jump points, init the kangaroos to your needs
    // ...

    int numLaunches = 1;    // extra multiplier for the total number of jumps
    int numThr = omp_get_max_threads();

    // use the max amount of threads that exactly divides the number of items
    while (numThr > 0 && numElements % numThr) numThr--;

    U64 gePerPart = numElements / numThr;
    printf("\tThreads: %u; elements/thread: %lu\n", numThr, gePerPart);

    double ompStartTime = omp_get_wtime();

    for (U32 launchIdx = 0; launchIdx < numLaunches; launchIdx++) {
#pragma omp parallel for
        for (U32 tIdx = 0; tIdx < numThr; tIdx++) {
            U64 offset = tIdx * gePerPart;
            secp256k1_ge * localGE = secp_ge + offset;

            computeBatchJump(localGE, secp_jp, gePerPart, JUMPS_PER_STAGE);
        }
    }

    double ompEndTime = omp_get_wtime();
    elapsedTime = ompEndTime - ompStartTime;
    speed = (double) totalCount / elapsedTime;

Good luck.
Hello kTimesG,

I’m working on a Pollard’s Kangaroo implementation for secp256k1 and I’d love to achieve high performance for point arithmetic on CPU (in particular, large-scale multiplications of G and other points). Could you please share or publish your HPC‐optimized code and techniques for kTimesG? I’m especially interested in any optimized field/group operations, batched inversions, or other CPU‐level optimizations you’ve used to speed up these computations.

Thank you,  Zoning5264