I see some Ma3() function in your kernel (I don't have it), which seems to be almost the same as the original Ma(), and my optimization could be applied to it as well. Why didn't you change this Ma3()? Any particular reason?
Good catch

I have been playing a little bit with the kernel. this macro was added by me and is not part of original kernel.