I'll go add official 8 and 16 wide support in a bit, should be useful on, say, AVX if you manually enable CPU mining in the code. SDK 2.6's cpu compiler apparently has gotten a lot better from what I've heard.
I'll be watching the repository then

It should almost definitely help with more modern CPUs and Larrabee/Intel MIC.
I just pushed a new commit. You can do vector layouts instead of the old bitmasks. So instead of -v 18, you do -v 1,1, instead of -v 36, you do -v 4, instead of -v 40, you do -v 4,4, instead of -v 4, you do -v 2,2, etc.