Here is a small neoscrypt kernel improvement for free, since I am mostly doing X11 anyway.
It gave me a 5.8% speedup on my reference R9 290 card (with Stilt bios),
from 290.2 to 307Kh/s at 800/1500 core/mem freq on Ubuntu 12.04 with stock drivers.
I didnt try it on my R9 280x cards, so please post your results if you try this.
You will have to mod the kernel as per the code below.
The bottleneck in this kernel is the way it stores the 128 intermediate results of chacha and salsa in global memory.
By doing the change below you are reducing stalls/latency by not making read/writes to same/adjacent memory banks.
Change:
void ScratchpadStore(__global void *V, void *X, uchar idx)
{
((__global ulong16 *)V)[idx << 1] = ((ulong16 *)X)[0];
((__global ulong16 *)V)[(idx << 1) + 1] = ((ulong16 *)X)[1];
}
void ScratchpadMix(void *X, const __global void *V, uchar idx)
{
((ulong16 *)X)[0] ^= ((__global ulong16 *)V)[idx << 1];
((ulong16 *)X)[1] ^= ((__global ulong16 *)V)[(idx << 1) + 1];
}
To:
void ScratchpadStore(__global void *V, void *X, uchar idx)
{
((__global ulong16 *)V)[idx] = ((ulong16 *)X)[0];
((__global ulong16 *)V)[idx + 128] = ((ulong16 *)X)[1];
}
void ScratchpadMix(void *X, const __global void *V, uchar idx)
{
((ulong16 *)X)[0] ^= ((__global ulong16 *)V)[idx];
((ulong16 *)X)[1] ^= ((__global ulong16 *)V)[idx + 128];
}
thanks, increase from 317 to 324 on 290x