There's no reason to force a choice between CPU and GPU and opencl is likely the best choice for scheduling and partitioning the work. A couple CPU cores could be used to pre-initialize the initial work state before being queued for hashing in a GPU command queue. If it's a CPU-only kernel, you can still make use of calls to inline-assembly in the host program for the opencl kernel(s) if you want (and can do better than SSEx optimization). Have fun with it.