That's why I don't want to support old cards: if I support them officially but not optimize you will blame me that they have bad speed.
But feel free to modify/optimize sources for your hardware

I'll be honest, your kangaroo finds the key faster than mine or jlp. Yes, the speed shows less, but in the end it finds it much faster.
Works even on 1660 super (~600Mkeys/s).
Thanks for sharing.
You can improve it in many ways.
For example, since L2 is useless for old cards, try to set
#define PNT_GROUP_CNT 48
and change these lines in KernelB:
//calc original kang_ind
u32 tind = (THREAD_X + gr_ind2 * BLOCK_SIZE); //0..3071
u32 warp_ind = tind / (32 * PNT_GROUP_CNT / 2); // 0..7
u32 thr_ind = (tind / 4) % 32; //index in warp 0..31
u32 g8_ind = (tind % (32 * PNT_GROUP_CNT / 2)) / 128; // 0..2
u32 gr_ind = 2 * (tind % 4); // 0, 2, 4, 6