Consider: today miners could use centralized devices to generate midstates, but instead the S9/R3 miners have fairly expensive FPGAs with 16MB of attached dual channel DDR2133 to generate the ~2000 midstates per second they need.
Did you mean 16 GB? 16 MB sounds a bit low and make 4-way collision finding quite expensive.
If it is 16 GB, does this indicate that they planned collision finding in their design? Or is there another reason you need that much memory for mining?
Using the same assumption of collection collisions for 5 seconds, you need about 45 MHash/s without commitment header to compute 500 4-way collisions and 550 MHash/s when witness is required.
But you can save more by using a longer collision time, e.g. 14 Mhash/s / 180 MHash/s for 30 s. What is the estimated hash rate of the FPGA?