CUMP only implements addition, subtraction and multiplication... doesn't implement mpz_powm which is needed for the primality tests. There's a ton of efficiencies that can be obtained just from optimizing the existing code, and without even trying to re-implement modular exponentialization in CUDA.
Will