Re: Solving ECDLP with Kangaroos: Part 1 + 2 + RCKangaroo

Quote from: tmar777 on January 01, 2025, 05:02:28 PM

Quote from: RetiredCoder on January 01, 2025, 02:17:08 PM

Updated Part #1, v1.4:

- added option to make K better at the range edges (for SOTA and SOTA+) - define BETTER_EDGE_K.
- added option to see interval stats - define INTERVAL_STATS.
- fixed some bugs.

First, Happy New Year to everybody!

RetiredCoder thank you for sharing your work!
I had commented on your repo with a pdf that may be useful to speedup the code. Also, I asked you kindly to contact me to my email but you didn't, which is a bit sad, but it's up to you.

Regarding the RCKangaroo version I have three questions:

1) does it use any CPU for DPs?
I ask because I use a rented GPU and I am not sure if I can get a powerful CPU from this provider

2) can you explain a bit the ideal -dp value for small and high puzzles and the impact on each case?

3) i rent an RTX 4090 but no matter the options I use, it doesn't utilize the full RAM of the GPU...why is that? is it from my side or a bug in your updated version?

Thanks

Hi tmar777,

You should learn a bit about how programming works with CUDA (Compute Unified Device Architecture) and how RetiredCoder has implemented it in his code.

In CUDA, you work with the following:
* Kernel: name of a function run by CUDA on the GPU.
* Thread: CUDA will run many threads in parallel on the GPU. Each thread executes the kernel.
* Blocks: Threads are grouped into blocks, a programming abstraction. Currently a thread block can contain up to 1024 threads.
* Grid: contains thread blocks.

If we focus on the software that RetiredCoder has shared on GitHub (https://github.com/RetiredC/RCKangaroo):

There are currently 4 Kernels:
* KernelGen: Runs once at the beginning this kernel calculates start points of kangs
* KernelA: this kernel performs main jumps
* KernelB: this kernel counts distances and detects loops Size>2
* KernelC: this kernel performs single jump3 for looped kangs

Then KernelA, KernelB and KernelC are run in a loop where if we ignore KernelGen (since it is only run once at the start) in percentages of time it would be something like this:
* KernelA: 90%
* KernelB: 4%
* KernelC: 1%

So, you have to understand that the executable that is run will run on the CPU and then in CUDA the Kernels will run on the GPU.

Answering your questions:
1) This problem does NOT require much CPU. This problem is computationally intensive on the GPU. Imagine that it needed 400x RTX 4090 to be able to complete the 129-bit puzzle.
2) To understand DPs and how they affect. First you need to know what a DP is: Distinguished points: a point is a distinguished point if its representation exhibits a certain bit pattern, e.g., has the top 20 bits equal to zero.

You have to know that you have a number of X kangaroos that make Y jumps every second.

Now if we use a DP of Z bits, that means that depending on the Z, that will be the average chance that a kangaroo will find a point that has that DP.

The higher the Z, the harder it will be to find that DP.

Now I recommend that for you to learn better in a practical way, you play with if I use a very low DP, what happens to the memory? How many points do I store?
Same if I use a very high DP.

3) When running any software, these are the metrics (there are more, but these could be the main ones) that one should look at:
* CPU usage
* Disk usage (disk I/O: input/output operations).
* RAM usage

With this you have a view of what is happening with the application you are running. In this case, and as I already told you, it is an application that uses CUDA, that is, it makes use of the GPU, therefore, you should analyze:
* GPU index (starts from 0)
* GPU name
* GPU temperature
* GPU memory usage (Used / Total). Here you can see how different memories are used
* Using CUDA Cores and SMs

You can use `nvidia-smi` or other tools like:
* https://github.com/XuehaiPan/nvitop
* https://github.com/Syllo/nvtop
* NVIDIA Nsight Compute (https://developer.nvidia.com/nsight-compute)

To sum up: the important thing is not the RAM or memory usage that the GPU uses, but the amount of work that it is processing in the kernels.

I hope it has helped you and if you have any questions, I am here to help you.