p.s. I also have another idea that should work on 4GB cards. The miner could use 12-slot bins of 32 bytes, just like silentarmy, but use a new table every round instead of using 2 tables in a double-buffered fashion. This would use 384MB * 9 =~ 3.5GB, but then your first write to any row could write 32-bytes of dummy data along with the 32-byte collision record. This would avoid the read-before-write. You could do this with the 2nd through 6th write by filling the even slots before the odd ones. This would reduce the average IO per round to 2^20 * 3 * 64-bytes, or 192MB per round and 1.728GB per iteration. That would be a theoretical max of 130 iterations per second on a Rx 470 with a 7Gbps memory clock, which would be around 240 solutions per second. Using 93% of the theoretical limit taken from eth mining, that would give real-world performance of 225 sols/s.
so the 225 would be max for the 4gb and the 8gb cards?
Yes. I'm pretty sure with 3.5GB for the table data that the remaining 0.5GB on a 4GB card would be enough for the row counters and any other small data structures required.