You should be able to complete the whole process on GPU so quickly that the communication overhead in any intermediary steps will impact the address generation throughput a lot.
Do you have a source of entropy that you'd like to use? If not, you may even be able to sample the private key on the GPU.
Why not to use any build-in RNG? Or you may generate them on CPU (using your favorite method), copy to GPU and then generate addresses.
Maybe look at that project:
https://github.com/bstatcomp/RandomCL (
https://link.springer.com/article/10.1007/s11227-019-02756-2)