As promised, here is the third and final part: RCKangaroo, Windows/Linux, open source:
https://github.com/RetiredC/RCKangarooThis software demonstrates fast implementation of SOTA method and advanced loop handling on RTX40xx cards.
Note that I have not included all possible optimizations because it's public code and I want to keep it as simple/readable as possible. Anyway, it's fast enough to demonstrate the advantage and you can improve it further if you have enough skills.