it seems to me that ModInv should be much lower than 50% of ModMulk1, are you sure you don't take significative advantage from using more than 256 elements for each batch? Why don't try with 1024 or 4096?
May be there is a confusion between IntGroup::ModInv(256*3 ModMulK1) and Int::ModInv (The true ModInv).
Look only a the column on the right (Self CPU).
ModInv is taking ~2% (using compressed address) so if I multiply by 2 the group size I can expect a ~1% speed increase for the CPU release. I did the test on 1 core and as expected, the key rate goes from 3.4MKey/s to 3.44MKey/s. Of course for other applications where you do not need to hash, you can expect a more significant speed increase.
I attach a new CPU profile with SSE disabled (-nosse option) and using compressed address, this profile should be close enough to the GPU profile, there is no SIMD instruction on GPU to speed up hash functions.

Here the ModInv fall to 1%.
For VanitySearch, having a smaller group size is better (This is a reason why I worked a lot on this DRS62 ModInv implementation). I can double the size of the group (I will definitely do it) but not more. The GPU kernel performs one group per thread and send back hash160 to the CPU. If the group size is too large, memory transfer and allocation become a problem. Divide and rule

It's amazing how much progress is being made on this software so quickly. Great work!
Thanks
