So it’s futile for me to pray to God to guess the puzzle. Without an accelerated ModMulK, is there really nothing that can be done?

To speed up ModMulK, you generally have a few options. There is no new math available that could solve this.
If the algorithm is already near-optimal (e.g., using Montgomery reduction), then hardware is the only way to go faster.
Rewriting ModMulK in assembly can improve performance, but only if the compiler’s output is suboptimal (e.g., missed register allocations, unnecessary spills, bugs).
Example:
modMulK:
mov rax, rdi ; x
mul rsi ; x * K (rdx:rax = full product)
mov rcx, rdx ; high bits
shld rdx, rax, 64 ; prepare for reduction
; ... Montgomery steps ...
ret
Replacing div with Montgomery/Barrett Can Give a 5–10x Speedup. These methods replace division with multiplications, shifts, and additions, which are much faster (often 1-5 cycles per operation).
However, if the compiler already used Montgomery or Barrett reduction, the improvement might be smaller—perhaps just 10–20% from fine-tuning.
You can also exploit CPU-specific features like AVX-256/AVX-512, carry-less multiplication, or fused operations—similar to techniques used in Cyclone.
Alternatively, 3,000 GPUs would indeed speed things up—assuming you have enough money to throw it out the window with a shovel. Therefore, you should already be rich enough to participate in this puzzle.
This is the "brute-force" approach—scaling horizontally with hardware and a lot of cash.