Sorry I've missed most of the recent discussion...busy with a new addition...
The r10 "trick" of letting it get clobbered requires you know something about the calling convention of the OS in use. The Linux SSE2 code for x86_64 will completely break on Windows, since there's a different set of registers saved between function calls. Depending on what OSX does with r8, you may be able to let it get clobbered, saving you a cycle or two.
I never really found any docs on how the calling convention works under Darwin/Mac OSX. I'm sure reading the GCC source code would enlighten someone. But I also like my sanity.