it does seem like there's at least one private GPU implementation for PTS, which is great for the person who has it, but contributes to their ability to mount a 51% attack on the network.
I have working GPU miner for Protoshares almost from beginning, but it is uselessly slow. I can open source OpenCL implementation of SHA512, if it will help you.
That part's easy - there's a CUDA implementation of SHA512 in the John the Ripper source with a permissive open source / public domain license. Already have that running.
What does uselessly slow mean? Based upon preliminary poking, my guess is that I can write something that will get 500 c/s on a decent card. It won't be the most amazing thing since sliced bread, but it will be substantially better than a CPU in terms of c/s/w and c/s/$.
(Yes, yes, I know I said I didn't want to put the time in, but I couldn't resist writing a few lines of code as a test before making the offer to write and open source it. :-)
-Dave