Interesting and probably faster because it's benchmark tested. He had the benefit of seeing the results
and tweaking. I gave up on super-optimizing memcpy and went with a simpler approach because all I
wanted was to avoid some the overhead to detect alignment, odd sizes and vector capabilities.
Most of the memcpy in cpuminer is with aligned data and integral sizes.