I'm using OpenCL with diffenent implementation for some functions.. and don't see a much speed difference between CUDA and OpenCL on same code.
NVidia Maxwell GPUs required different sieve implementation, if I create it, GTX970 will do 9-9,5CPD, GTX980 up to 11CPD, but it not easy

You should be right as you are the expert. My thinking is simply based on the fact Cuda being more "native" on Nvidia platform and miners using Cuda (e.g. ccminer) seem to be much faster than OpenCL miner on those X11/13/15 algorithm.