Re: [ANN] ETHASH Miner - Eminer v0.6.1-RC2 Released (Windows, Linux, MacOS)

any update soon?

I'm working for new update, I will announce soon.

You're doing better with keeping your IP hard(er) to get at - nice work. Might want to use C instead of Go so you can put in some anti-debugging traps, too.

Also... in your "Optimized 4 threads" OpenCL... I know this won't make a huge difference, but it still makes me sad:

Code:

bool update_share = thread_id == ((a >> 3) % THREADS);

Were that interpreted how you wrote it (verbatim), you would be eating... 4 clocks for the shift, and an untold number of clocks for the modulo, as AMD GCN doesn't even have modulo implemented - it's emulated. In reality, it's going to optimize that to an AND, making it look more like this:

Code:

bool update_share = thread_id == ((a >> 3) & (THREADS - 1))

Okay, that's 8 clocks - two full-rate instructions. Unless you use a local worksize that's not a power of two, and in that case, god help you. But we can do better.

First, for AMD - you're using LDS. Any worksize above wavefront size (64 work-items) is going to make those barriers take a LOT longer. Reason being, an AMD GCN CU (Compute Unit) only executes wavefronts. Larger work groupings are cut into wavefront-sized pieces and executed that way. If you use an LDS barrier with a worksize of 64 (wavefront size) - big deal, the whole wavefront is handled by that CU, and as such, the barriers can actually be omitted! And they are, by the AMD OCL compiler, in that case. When you make your group larger - suddenly you need to sync with other execution units: this is going to make you very sad. So - at least for AMD - I would keep it to 64 work-items/local group.

Now - we know the value of THREADS - the kernel says it's a 4-way - so... get you one of these...

Code:

#pragma OPENCL EXTENSION cl_amd_media_ops2 : enable

... and suddenly you have access to the amd_bfe builtin, which is purpose-built for what you wanna do! It gives you direct access to the v_alignbit_b32 instruction. And this gives us:

Code:

bool update_share = thread_id == amd_bfe(a, 3U, 2U);

Shaving at LEAST four ticks off the best the compiler would likely come up with (and I am being generous.) As it's looped, this magnifies the effect of the optimization, although it's still not much. Now you have the operation done in a single full-rate instruction, though - which IS an improvement.

WOW! tons of job like for me. @Wolf thanks your mention. I'll keep up these and will be share results here.