I don't know any ready-to-use 256bit number numpy libraries, but it is possible to create one, using 64 or 32bit numbers for math operations.
You cannot just speed up individual operations by using GPU, because single CUDA core is much slower than CPU. You need to divide full computing work into many independent tasks in order to get the performance gain.