You should be able to complete the whole process on GPU so quickly that the communication overhead in any intermediary steps will impact the address generation throughput a lot.
Do you have a source of entropy that you'd like to use? If not, you may even be able to sample the private key on the GPU.
Simply do the EC math to receive a public key, as well as the SHA256 and RIPEMD160 all on GPU.
This is a Python (CPU) version of what I believe you're trying to make; maybe it can help as orientation:
https://github.com/Isaacdelly/PlutusAnd some GPU bruteforcing tools:
https://github.com/bkerler/opencl_brute