Alexis' code is very efficient, as every code I´ve seen from him.
His Skein code results gives INT32 pipe utilization already pretty high (~95% on 1080Ti).
I don't think we can expect too much (if any) improvement, unless you can reduce the total INT32 Operations to calculate hashes.
Not sure if skein has such room for improvement.