That part's easy - there's a CUDA implementation of SHA512 in the John the Ripper source with a permissive open source / public domain license. Already have that running.
What does uselessly slow mean? Based upon preliminary poking, my guess is that I can write something that will get 500 c/s on a decent card. It won't be the most amazing thing since sliced bread, but it will be substantially better than a CPU in terms of c/s/w and c/s/$.
(Yes, yes, I know I said I didn't want to put the time in, but I couldn't resist writing a few lines of code as a test before making the offer to write and open source it. :-)
-Dave
Something like 300000 shares/day, that's 300000/2/24/60/60=1.7 collisions per second

My 6770 can only allocate 520MB of memory and have no 64-bit atomics, so I implemented backet sort (using backet number as storage for high bits of the hash

) then uploaded entire memory to CPU and used bloom filter to find duplicates. Well, some day I put my hands on better GPU, but right now they are sold out.