Hello, guys!
Ultra lightweight CUDACyclone is ready, speed is 1.3Gkeys/s on RTX4060, 4.3-4.4 Gkeys/s on RTX4090.
Key feature - extremely low VRAM usage for rented gpu. Less than 500Mb VRAM.
And also this is a good studying sample for your education (why not)? Total 7 small files.
Link:
https://github.com/Dookoo2/CUDACyclone