Here is a small neoscrypt kernel improvement for free, since I am mostly doing X11 anyway.
It gave me a 5.8% speedup on my reference R9 290 card (with Stilt bios),
from 290.2 to 307Kh/s at 800/1500 core/mem freq on Ubuntu 12.04 with stock drivers.
I didnt try it on my R9 280x cards, so please post your results if you try this.
You will have to mod the kernel as per the code below.
The bottleneck in this kernel is the way it stores the 128 intermediate results of chacha and salsa in global memory.
By doing the change below you are reducing stalls/latency by not making read/writes to same/adjacent memory banks.
Change:
void ScratchpadStore(__global void *V, void *X, uchar idx)
{
((__global ulong16 *)V)[idx << 1] = ((ulong16 *)X)[0];
((__global ulong16 *)V)[(idx << 1) + 1] = ((ulong16 *)X)[1];
}
void ScratchpadMix(void *X, const __global void *V, uchar idx)
{
((ulong16 *)X)[0] ^= ((__global ulong16 *)V)[idx << 1];
((ulong16 *)X)[1] ^= ((__global ulong16 *)V)[(idx << 1) + 1];
}
To:
void ScratchpadStore(__global void *V, void *X, uchar idx)
{
((__global ulong16 *)V)[idx] = ((ulong16 *)X)[0];
((__global ulong16 *)V)[idx + 128] = ((ulong16 *)X)[1];
}
void ScratchpadMix(void *X, const __global void *V, uchar idx)
{
((ulong16 *)X)[0] ^= ((__global ulong16 *)V)[idx];
((ulong16 *)X)[1] ^= ((__global ulong16 *)V)[idx + 128];
}
Not working well with Wolf0's Hawaii mod. Hash rate dropped from 339kh/s to 320kh/s.