Marc, have you considered making NR_ROWS_LOG fixed at 20, and clearing cnt for each round? Then each row would have only collisions at each round, avoiding the need to search for collisions. Then NR_SLOTS/OVERHEAD could be significantly reduced.
To avoid time penalty for clearing the cnt values, instead of calling kernel_init_ht at each round, you could zero the count after checking the row. When the DRAM page is already open to check the cnt value, the time cost of doing a write back to the same page (in fact, same 64-byte cache line) is minimal.