we could also consider enumerate multiple nonces and discarding those that don't match a subset of page rows, i.e. process the cuckoo table in chunks with multiple passes.
As mentioned before, this is exactly what higher values of PART_BITS achieve, as you can see in cuckoo_miner.h and cuda_miner.cu
Each trimming round uses 2^PART_BITS passes with the number of counters reduced by the same factor. With sufficient reduction, you end up using only one row in each memory bank.
What I meant before is that it isn't the same tradeoff as the h/w threads because as you said it increases the number hashes computed:
Alternatively, the algorithm parameter PART_BITS allows for a reduction in the number of counters in use at the same time, which is what your proposal essentially amounts to. Setting this to 21 will require only 2^8 counters, one per memory bank. But now your hash computations have increased by a factor 2^21, over 2 million.
No that is not equivalent to increasing the number of h/w threads and syncing them to pause until 2^13 of them are queued up to read a shared memory page/row.
The point is the ASIC can balance some of the two strategies to find the optimum. The CPU can't effectively leverage either strategy.