we could also consider enumerate multiple nonces and discarding those that don't match a subset of page rows, i.e. process the cuckoo table in chunks with multiple passes.
As mentioned before, this is exactly what higher values of PART_BITS achieve,
as you can see in cuckoo_miner.h and cuda_miner.cu
Each trimming round uses 2^PART_BITS passes with the number of counters reduced by the same factor.
With sufficient reduction, you end up using only one row in each memory bank.