Re: [ANN] cudaMiner - a new litecoin mining application [Windows/Linux]

Quote from: cbuchner1 on December 22, 2013, 08:47:04 PM

Quote from: dga on December 22, 2013, 03:28:25 PM

Semi-idle question: Is there community interest in sponsoring some more optimization of the cudaminer code for Kepler GK104 and GK110-based cards?

In honesty, I think I threw my best optimization ideas at the version most of you are already running. *grins* But there's probably another 5% here and there, which could translate into, say, at least a 10-20kh/sec boost on faster cards.

-Dave

You could also optimize on the low-end GT 640 (GDDR 5 version). It has Compute 3.5, does around 105 kHsh/s with OC and I generally found its results to scale up pretty linearly with the number of SMX. i.e. scaling it up to 12 SMX (like the GTX 780) yields some 630 kHash/s which people actually seem to be hitting when overclocking their devices.

EDIT: I did some profiling with Cuda 5.5 Visual Profiler on a Compute 3.0 device recently (also 2 SMX, a laptop part). I found that the artithmetic units were pretty maxed out. And it also showed an 80% efficiency in the instruction scheduler. Meaning that the dual issue feature in each SMX four warp schedulers was pretty nicely utilized. The occupancy on each SMX was 100%, which is perfect. Memory accesses were fully coalesced 128 byte transactions. Can't get any better than this.

Ooh. Good idea for the cheaper device - thank you.

Re "can't get any better" - that's the other reason I was thinking about grubbing for help. I'm guessing that the remainder of the optimization is going to be ugly. I've been staring at, e.g., the cuobjdump assembly output and the instruction throughput tables and trying to figure out if there are ways to improve it (nothing obvious). And, as you note, 80% instruction scheduling is already quite high. Doubling up keys in a clever way might get that to 90 but at the cost of probably unacceptable register pressure. I tried it once and threw away the code, but there are a few other ways to imagine doing it.

It's really hard to beat the raw number of ALUs those AMD devices have when the code is as trivially parallel as brute-force hashing.