Should we roll the Lookup-Gap into kernel launch configurations?
how does T12x32/6 look like to you? ;-)
No issues with the YAC wallet on Windows here, but mine does start horribly slowly on Linux (takes up to an hour). I pulled it from the official PPA repository for stable builds.
The reason for autotune crashes on Windows with lookup gap seems to be rising memory usage during the autotune process. e.g on my 780Ti as soon as the "Memory Used" value shown in GPU-z hits 3072MB, the driver will crash. I could fix it by adding a configurable "backoff" parameter in percent. The default value on Windows should be higher than on Linux, probably around 10% on Windows and 2% on Linux. Alternatively I could allow giving the backoff in MB also.
For a very quick fix in the current source code, increment the parameter 2 in this for loop in salsa_kernel.cu to something higher - like e.g. 2*LOOKUP_GAP. It should fix auto-tuning when single-memory allocation is not enabled.
for (int i=0; warp > 0 && i < 2; ++i) {
warp--;
checkCudaErrors(cudaFree(h_V[thr_id][warp]-h_V_extra[thr_id][warp]));
h_V[thr_id][warp] = NULL; h_V_extra[thr_id][warp] = 0;
}
UPDATE: I also find that CUDA sometimes kills the autotuning process with the error message "the launch timed out and was terminated. This might be fixed by auto-tuning with smaller batchsize (-b) parameters, like e.g. 1024. CUDA has a watchdog timer that will kill kernel calls that take longer than 5 seconds. This is to avoid permanent display freeze when some computation gets stuck.
I am also considering to also allow specifying the devices like in the following example because whenever I swap cards around on my mainboards, all the device IDs get shuffled by CUDA which is annoying. The strings however would keep working as is, unless you remove the card with the given name.
-d "GT 640, GTX 780 Ti, GTX 660 Ti, GTX 660 Ti#2"
Christian