It turned out that it is impossible to tune memory timings by hand for on-the-fly memory timing mods, and I ended up implementing a fully automated optimizer for memory timings, overclocking, and algorithm parameters such as intensity and global work size. I was pulling my hairs over this stuff, but it should almost be over...