Guys, to optimize cpu miners use a midstate approach.
You can put the block header's first 80 bytes (including nonce) into the sph_fugue256() function
and all that changes between hashes in the sph_fugue256_context is the ctx_fugue.partial field
which is nonce in big endian notation.
Duh, so precompute state once per scanhash call - all remaining work is done in
sph_fugue256_close() run inside the scan loop. instant 5x speed gain.
My Phenom II X6 got 2.5 MHash/s out of this on 6 cores.
It's a low hanging fruit really.
Next, address SSE2 or AVX vectorization and get close to 5-10 MHash/s.
Christian