Yes, that W[] array is moved (by compiler) to registers on GCN, but apparently on VLIW it is not and uses global memory, which is slow. This can be improved of course (and first of all it does not have to be 62 elements long, 16 elements is enough if you reuse them). Just wonder how have you managed to compile sha256_res(sha256_res()): it takes uint16 vector as parameter, but returns only one uint.
I've tried both
(sha256_res((uint16)sha256_res(as_uint16(skein512_mid_impl(state, msg)))) & 0xf0ffffff)
and
(sha256_res(sha256_res(as_uint16(skein512_mid_impl(state, msg)))) & 0xf0ffffff)
And it compiles, probably only getting wrong results. But it still enough for test, as sha256_res runs twice, maybe only with wrong input on second run

Besides, double Skein runs and 780MH/s on 5870, so SHA256 is current bottleneck for sure. With good sha implementation we will be able to reach even better performance, than SHA256D

Casting uint to uint16 compiles on your system? Guess something is awfully wrong with it, then, it is against OpenCL spec (and common sense
). Anyway, it would be great if you manage to optimize sha256, I have only quickly thrown together something that worked for me and feel somewhat embarrassed now that it is public.