Thank you for taking time to respond!
While I do admire that you had the skills and resources to break three ECDLP problems in a row, judging by your expertise you know very well that everything is a tradeoff when it comes to programming. I still stand by all my previous comments regarding this: cycle handling slows down the jumps. Another way to view this is: even with a very fast optimized cycle-handling kernel such as yours (much faster than some whatever JLP reference fork), it can be made to run faster if we trade the resources for cycle handling to enabling more jumps. The question at the end of the day is: from what point on is it worth it to either have low "K" with slow jumps, or a higher "K" with faster jumps. And yes, I did manage to reach 13.4 G/s on a RTX 5090 without even compiling natively to ccap 12.0, so the question is even more interesting now.
I was more interested about one of your older replies, regarding the fact that an optimized version is not even twice as fast as RCKang, which hinted that maybe somehow you managed to reach 14 Go/s on a RTX 4090, which would have been fascinating, considering that the public version can't reach 8 G/s.
Currently I have about 12.8GKeys/s on 4090. 5090 is a shame, I skip it and wait for next generation.
Perhaps I will make all my sources public when #135 is solved, though I'm not sure, people are not interested in what I do, also I see zero good discussions on this forum about EC, so better I will spend my time for more interesting things
