The point is that if you have enough threads running, the likelihood is there are 2^12 threads wanting to read from the same row, thus your algorithm is no longer bounded on memory latency rather on memory bandwidth and moreover that there isn't 1 hash per memory row buffer load but instead 2^12 hashes, and thus actually your algorithm is computation bound and will be 1000s times more power efficient on the ASIC.
I stated earlier
To avoid latency, you could increase the number of memory banks by a factor of 2^7,
but this similarly increases die area, hardware costs, and energy usage.
Your above proposal of simultaneously running millions of hardware threads and needing to sync them
incurs much worse increases in die area (to the point of not fitting on any single die),
hardware costs, and energy usage.