4090 cards have a lot more cache than previous ones.
Looking for changes to fit to cache as much data as possible to speed up kangaroo and prevent memory bottleneck kangaroo have.
Code def needs tweaked to see if there as any pickup in performance.
Code also needs tweaked in how it finds DPs.
The best speed I got out of Kangaroo with a 4090 was 7,750 MKey/s and an A100 got 7,350 MKey/s; but I had tweaked the way DPs were found.
I haven't messed with the code since #125 was found though.
Hope you have success!