Marc, have you considered making NR_ROWS_LOG fixed at 20, and clearing cnt for each round? Then each row would have only collisions at each round, avoiding the need to search for collisions. Then NR_SLOTS/OVERHEAD could be significantly reduced.
To avoid time penalty for clearing the cnt values, instead of calling kernel_init_ht at each round, you could zero the count after checking the row. When the DRAM page is already open to check the cnt value, the time cost of doing a write back to the same page (in fact, same 64-byte cache line) is minimal.
Yeah NR_ROWS_LOG is pretty much always compiled at 20. But I have to offer the other options because people want to mine with GPUs having very little memory. And I don't know if you checked the latest commits, but OVERHEAD has been lowered to 9 so 1 Equihash instance needs only 1.2 GB, and I recently made the exact change you suggested (clearing the counter after we are done using/reading it).