Amazing job sir. A hero we all needed within this chaos.
Optimize can’t be set per file base? Split the code that can’t be optimized?
Can’t wait to see this.

GCC's .cpp files are already compiled with -O2, and here are the NVCC flags I put in the Makefile:
NVCCFLAGS=-std=c++11 --ptxas-options="-v --opt-level 0" -Xcicc -O0 --compile --compiler-options -O2 -gencode=arch=compute_${COMPUTE_CAP},code=sm_${COMPUTE_CAP}
So the C++ files are compiled with -O2 but I completely disabled cicc and ptxas optimization. This resulted in several more CUDA functions being included in the Cubin (versus about 6 when the default optimization level 3 is used).
I want to see if t behaves properly under optimization levels 1 and 2 first. At 20 MKey/s it looks like there's still more work to do though I am aware that I can't push the optimization all the way up, so speed's never going to be as fast as it should be.
(Default block thread and points settings this time)
