Thanks, any chance of an AVX version (if there is any speed boost)?
AVX? likely not any time soon. The code for both SSE4 and AVX requires a expensive startup and finish, just for one round. It's currently slower. The default is an assembly routine wrote by the original creators for the groestl hash.
The largest problem now is actually size. We're swapping between 6 different things all at once, all the time. Small changes in loop sizes make 30% differences in speed.
Out of the six; skein bmw and jh are already so much faster that it's a wast of time to improve them. Smaller however it important. It's slow to keep track of more stuff. Those three combined only use around 15% of the cpu, combined.
Next step personally is to shrink everything some more. New features should get worked in after that.