I don't know any ready-to-use 256bit number numpy libraries, but it is possible to create one, using 64 or 32bit numbers for math operations.
You cannot just speed up individual operations like point multiplication by using GPU, because single CUDA core is much slower than CPU. You need to divide full computing work into many independent tasks which will run in parallel in order to get the performance gain.
What you are suggesting is that it is just easier to use C++ with boost, while simultaneously implementing a multithreading approach, and everything would be running on a CPU?
I mean it could be a better way to think about it, since I also want to port that on laptop or other devices that don't have either a GPU or NVIDIA tool kit installed