P.S. This design does not "take up too many IOs", taking up too many IOs assumes you don't use some sort of serialization, which is rather absurd.
Well, maybe you could tell me how you would call it to produce an ip core with 288 IOs for a device that has only 167 (or so) soldered on a board with some SDRAM etc ending with 80 Usable user IOs? Btw, you could enligten me wich microcontroller you plan to use that can write 256 bits at once. There is no bottleneck in using some sort of serialization at all. And even if there were, you could always reduce the bandwith requirement by implementing roll-n-times in hardware.