This is truly unfortunate… Each conversion of a number into a Bitcoin address requires around 1,700 simple operations. Even if this process could be optimized to a single basic operation, brute-forcing the 68th range would still take at least six months, even on the most powerful GPUs like the RTX 4090 or 5090. As far as I know, all existing brute-force programs such as KeyHunt and BitCrack utilize only the CUDA cores of GPUs. However, there is an untapped source of power tensor cores which remain unused. The theoretical performance of CUDA cores is around ~80 TFLOPS for the RTX 4090 and ~100 TFLOPS for the RTX 5090, while tensor cores offer significantly higher performance: 285 TFLOPS for the RTX 4090 and 400 TFLOPS for the RTX 5090. If tensor cores could be utilized, the speed of brute-force calculations could be increased several times over. In theory, tensor cores are also capable of handling matrix multiplications and similar computations. Currently, the maximum speed achievable using publicly available CUDA-based programs from GitHub with an RTX 5090 is around 9GKeys per second.
Tensor cores only do 8-bit and 16-bit float ops (e.g. extremely low precision for floating point numbers). So those TFLOPS you see are relative to these kind of numbers, not 32-bit integers/floats. We'd need some Bernstein-level genius mind to help us make use of them when dealing with ECC. They can potentially be put to use to accelerate the inversion, but this requires coming up with a new inversion algorithm. Something that can work via approximations instead of exact values, to find the inverse faster.
I wasn't aware of such limitations of tensor cores—now it makes sense why they haven't been used for key enumeration yet. I'm curious, if it were possible to leverage them at least for auxiliary inversion, what theoretical speedup could be expected?