Yes. Maybe I'm like a broken record, but carry-free 5x52 is much faster than 4x64 arithmetic, on CPUs. And it's parallelized away by the compiler automatically. At least by clang.
How to hard is to change from one to other, I mean in order to update this tool by myself?