Now, imagine adapting this to the GPU and using AVX2—combining both CPU and GPU power.
An RTX 3060 achieves around 2300 Mkeys/s, and so on.
You are making a wrong assumption here: that you can simply combine GPU output with CPU calculations, or something. The calculations performed by a GPU are internal to the device (no memory transfers between host and device).
If we remove hashing step from the CUDA code, then the speed is like a few times higher. For example, around 25 GK/s just to produce public keys of a sequential range. Even more if we add the sym/endo (but these are derived from a single public key anyway, there is no point doing it on a GPU).
But in in no way you could ever transfer the huge amounts of results back to the CPU world, in order to hash it. You are limited by the memory clock of the GPU, the total memory of the GPU, and the bus width of the PCIe lanes. So the end result is a very lousy speed, something that is totally unmanageable by a CPU, no matter how many cores you have.
Or did I understand something wrong?