Here is a few year old example where an implementation of ECM was developed and compared against the standard CPU and the result was roughly 2x faster. I know we're not talking about ECM here but again it's suggestive of what one might expect.
http://eecm.cr.yp.to/gpuecm-20090127.pdfmodular arithmetic is very easy to implement with the four basic arithmetic functions, so I'm not sure what the holdup is around that?