The ASIC can choose to use DRAM instead and amortize the power consumption over the row buffer.
Here's a possible concrete design of what you are hinting at:
The ASIC will have as many (pipelined) siphash circuits as are needed to max out the memory
bandwidth of all the DRAM you're gonna hook up to it.
For each row on each memory bank, the ASIC has a queue of in-row counter-indices.
Each entry in the queue thus represents a "stalled thread".
The siphash circuits fill up these queues at a mostly uniform rate.
Once a queue fills up it is flushed to one of the memory controllers on the ASIC that
will perform the corresponding atomic updates on the memory row.
This looks pretty efficient, but all those queues together actually take up a lot of memory,
which the ASIC needs to access randomly.
So it seems we've just recreated the problem we were trying to solve.
But we can aim for the total queue memory to be maybe 1% of the total DRAM.
There's a lot of parameters here and many different ways to optimize with
combinations of speed, energy cost, and hardware cost...