Using shared Memory is undoubtedly a fairly exponential advantage, but you can't use it that way when you have to do a complete refactoring of all the GPU libraries § (In fact with your code it is not even possible to compile) Anyway, you understood a priori that Shared Memory is the final goal

I thought you knew how to code, I gave you the basics, if you can't code it to make it work you are an bad Coder.