The first benchmark on ARM (Raspberry PI 400 without overclocking ) are better than I expected (around 30MKeys/s with 4 cores).
That's an impressive speed, considering that Xeon E3-1230 @3.2GHz with 4c/8t only does 10MKeys/s and that the CPU inside this is:
Broadcom BCM2711 quad-core Cortex-A72 (ARM v8) 64-bit SoC @ 1.5GHz
So clock speed and thread count is halved on this silicon but yet it's 3x faster. And this makes sense because x86_64 has 16 64-bit registers and 8 128-bit SSE registers (Kangaroo doesn't use AVX yet), while armv8-a (the arch used in Cortex A72) has 64 registers and 32 more of the wider registers.
x86 was always inefficient arch anyway because of all the backward compatibility it had to preserve. With ARM it's like "recompile for our new generation or else" and stuff written for armv8 won't work on armv7 AFAIK.
Do you know if there is a very low cost card (mini PC) with ARM CPU (similar to raspberry pi compute module 4 see link below) to do the work (in a parallelized way) and see if the ratio (computing power)/price might be better compared to the latest GPU CUDA Compatible graphics card.
A 2080Ti costs $1000 and can do about ~1100MKeys/s per my guessing. A Raspberry Pi 4 costs $35 and you can buy up 28 of those for each 2080Ti, and all those Pi 4's combined can do 30x28=840 MKeys/s which is only a little slower than GPU's speed. If you manage to push it a little over 30MKeys/s (like by using Neon instruction set) hen you can match GPU performance.