Yesterday I realized I forgot to account for the 1:1 command:burst ratio. Because of that, doing 64-byte random writes will be no faster than 128-byte random writes.
I don't think this is correct. See the diagram below from the JEDEC GDDR5 specification showing gapless reads from a single row; half the command slots are NOPs (particularly at time=T1). These slots can be used to send ACTIVATE commands to other banks, so as long as you distribute your workload across the chip's banks (and observe that annoying tFAW or find a chip that doesn't really need it) you can indeed do totally random reads at the full pin bitrate and the only requirement is that the read size is at least 32bits*8wordburst=256bits=32bytes by using AUTO PRECHARGE to eliminate the explicit PRECHARGE command on the command bus.
In particular, the command:burst ratio is 2:1 for 8word bursts, not 1:1. Maybe you missed the fact that the address pins are DDR in GDDR5?
I have implemented a system that does exactly this (stealing that NOP slot to ACTIVATE a different bank) on DDR2 and it works. Granted DDR2 is not GDDR5, but the idea that an ACTIVATE-READ command pair ought to use the same amount of command bus time as the data it procures is something JEDEC has worked to preserve across many generations of memory with a wide variety of burst lengths and timings. I don't think GDDR5 would depart from this lightly.
Everything I wrote in this post is about GDDR5 independently of any particular GPU or even all of the GPUs on the market taken together. GPU memory controllers are
not optimized for random/scattered reads like you find in most cryptocurrency mining PoWs. I would not be surprised if no GPU is actually able to do scattered full-bandwidth reads at the minimum 256-bit granularity allowed by the GDDR5 spec; that's just not something that's a top priority for rendering video games.
