how to use it with RTX2070super, please, because I only have 2070, I am very interested in testing your work. I tried modifying the parameter settings but it failed.
As far as I remember these cards have only 64KB of shared memory.
Set JMP_CNT to 512 and change 17 to 16 in this line in KernelB:
u64* table = LDS + 8 * JMP_CNT + 17 * THREAD_X;
and recalculate LDS_SIZE_ constants.
I think it's enough, though may be I forgot something...
The main issue is not in compiling, but in optimizations, my code is not for 20xx and 30xx cards, so it won't work with good speed there.
That's why I don't want to support old cards: if I support them officially but not optimize you will blame me that they have bad speed.
But feel free to modify/optimize sources for your hardware

May I ask why, my 4060ti graphics card has a speed of just over 2000
CUDA devices: 1, CUDA driver/runtime: 12.6/12.5
GPU 0: NVIDIA GeForce RTX 4060 Ti, 16.00 GB, 34 CUs, cap 8.9, PCI 1, L2 size: 32768 KB
Total GPUs for work: 1
Solving point: Range 76 bits, DP 16, start...
SOTA method, estimated ops: 2^38.202, RAM for DPs: 0.367 GB. DP and GPU overheads not included!
Estimated DPs per kangaroo: 23.090.
GPU 0: allocated 1187 MB, 208896 kangaroos.
GPUs started...
MAIN: Speed: 2332 MKeys/s, Err: 0, DPs: 345K/4823K, Time: 0d:00h:00m, Est: 0d:00h:02m
MAIN: Speed: 2320 MKeys/s, Err: 0, DPs: 704K/4823K, Time: 0d:00h:00m, Est: 0d:00h:02m