Good, but this
> - added: "u t1W" variable, which is used in sharound2() to avoid double execution of t1W()
may actually hurt the performance, in theory. If you're using more registers, at least some GPUs may not be able to run as many threads concurrently as they used to, thus slowing things down.