Search Posts

Post

Topic

Board Mining (Altcoins)

Re: An (even more) optimized version of cpuminer (pooler's cpuminer, CPU-only)

ag1233

on 15/09/2023, 05:37:25 UTC

Quote from: JayDDee on September 14, 2023, 07:55:40 PM

It looks like your comparing original code with -ftreevectorize to "hand coded" with -ftreevectorize. That doesn't prove anything about -ftreevectorize.
You need to test the same code with and without vectorization. Did your hand coded version actually use parallel SIMD Salsa on the data "arranged in lanes"?

ok the original driving codes is like such

Code: ("main.c")

#include <time.h>
#include "salsa.h"
void salsa(uint *X, uint rounds);

int main(int argc, char **argv) {
uint X[16];
const int rounds = 1024*1024;
clock_t start, end;
double cpu_time_used;

for(int i=0; i<16; i++)
X[i] = i;

puts(abin2hex((unsigned char *) X, 4*16));
start = clock();
salsa(X,rounds);
end = clock();
cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("cputime %g\n", cpu_time_used);

}

/* Salsa20, rounds must be a multiple of 2 */
void __attribute__ ((noinline)) salsa(uint *X, uint rounds) {
uint x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, t
;

x0 = X[0]; x1 = X[1]; x2 = X[2]; x3 = X[3];
x4 = X[4]; x5 = X[5]; x6 = X[6]; x7 = X[7];
x8 = X[8]; x9 = X[9]; x10 = X[10]; x11 = X[11];
x12 = X[12]; x13 = X[13]; x14 = X[14]; x15 = X[15];

#define quarter(a, b, c, d, v) \
t = a + d; if (v) printf("t: %d\n",t); \
t = ROTL32(t, 7); if(v) printf("t: %d\n",t); \
b ^= t; if(v) printf("b: %d\n",b); \
t = b + a; if(v) printf("t: %d\n",t); \
t = ROTL32(t, 9); if(v) printf("t: %d\n",t); \
c ^= t; if(v) printf("c: %d\n",c); \
t = c + b; if(v) printf("t: %d\n",t); \
t = ROTL32(t, 13); if(v) printf("t: %d\n",t); \
d ^= t; if(v) printf("d: %d\n",d); \
t = d + c; if(v) printf("t: %d\n",t); \
t = ROTL32(t, 18); if(v) printf("t: %d\n",t); \
a ^= t; if(v) printf("a: %d\n",a);

int v = 0;
for(; rounds; rounds -= 2) {
quarter( x0, x4, x8, x12, v);
quarter( x5, x9, x13, x1, v);
quarter(x10, x14, x2, x6, v);
quarter(x15, x3, x7, x11, v);
quarter( x0, x1, x2, x3, v);
quarter( x5, x6, x7, x4, v);
quarter(x10, x11, x8, x9, v);
quarter(x15, x12, x13, x14, v);
}

X[0] += x0; X[1] += x1; X[2] += x2; X[3] += x3;
X[4] += x4; X[5] += x5; X[6] += x6; X[7] += x7;
X[8] += x8; X[9] += x9; X[10] += x10; X[11] += x11;
X[12] += x12; X[13] += x13; X[14] += x14; X[15] += x15;

#undef quarter
}

that will run 1024 x 1024 rounds ~ 1048576 rounds
without optimization i.e. no -O flags

Code:

cputime 0.187971
cputime 0.231245
cputime 0.187873

~ 202.363 ms for that 1048576 rounds

with optimization -O2 but no -ftreevectorize

Code:

cputime 0.011749
cputime 0.011733
cputime 0.025701

~ 16.394 ms for that 1048576 rounds

Post

Topic

Board Mining (Altcoins)

Re: An (even more) optimized version of cpuminer (pooler's cpuminer, CPU-only)

ag1233

on 14/09/2023, 16:02:23 UTC

I did one more experiment doing salsa20 with neon Intrinsics
https://arm-software.github.io/acle/neon_intrinsics/advsimd.html
'hand optimized' but naive
run salsa20 with a million loops
and it turns out gcc's -ftreevectorize did 1 million loops in 11 ms original C codes

then rearranged arrays with -ftreevectorize did 1 million loops in 80 ms (but varies dependent on the cache), 2nd tends to be faster
this version writes out arrays to memory during permutation. lots of wait states and stalls.

and naive 'hand optimized' salsa20 with neon and all takes 59 ms for 1 million loops

this kind of means that it isn't true -ftreevectorize is slow, in that it sometimes beat 'naive' hand optimized codes. it is probably close to being a best optimized codes, but that the generated assembly is practically unreadable. generated by machines for machines.

Post

Topic

Board Mining (Altcoins)

Re: An (even more) optimized version of cpuminer (pooler's cpuminer, CPU-only)

ag1233

on 12/09/2023, 18:22:32 UTC

JayDDee,
Thanks for your comments, I'd leave it at that for now as many others would be reading this thread.
Rather, it is correct to say that -ftree-vectorize is not a 'miracle pill', but that in some cases, one may observe an edge, and actually it isn't that much more, I think that 20% gain is not rigorously measured.
i.e. expect less gains in fact with the use of -ftree-vectorize. But that given my experiments, it is likely there is 'some' gains, but not a lot.

Post

Topic

Board Mining (Altcoins)

Re: NSGminer v0.9.4: The Fastest NeoScrypt GPU Miner

ag1233

on 12/09/2023, 17:58:47 UTC

hi ghostlander,
are you still monitoring this thread?

hi all,

oops, I've posted my comment in the 'wrong' thread
https://bitcointalk.org/index.php?topic=55038.msg62832733#msg62832733
reposting here

recently I tried getting cpuminer to run on a Raspberry Pi 4 (aarch64)
I'm using somewhat older codes (version 2.4) from
https://github.com/ghostlander/cpuminer-neoscrypt

I couldn't figure out how to get it to build with the ARM assembly codes, and apparently scrypt-arm.S seemed to be written for armhf (32 bit ARM microprocessors).
Hence, there could possibly with issues compiling in aarch64 (ARM 64bit instructions and OS)

Among the things I tried, I added the following flags
Code:
-O2 -ftree-vectorize -ftree-slp-vectorize -ftree-loop-vectorize -ffast-math -ftree-vectorizer-verbose=7 -funsafe-loop-optimizations -funsafe-math-optimizations

I checked and try compiling with "-S" option which makes it generate assembly codes, apparently among the suite of flags used above, it causes GCC to generate assembly codes with NEON SIMD.
This is without specific hand optimized assembly. It may possibly still make NEON assembly with a few less flags (e.g. possibly less -ftree-slp-vectorize, but that I think this is useful even without NEON), but that when missing some of the above flags, NEON assembly isn't generated.

I tried with and without the above flags e.g. just -O2, there is at least a slight difference in hash rates, from about 4 khash per sec on all 4 cores doing Neoscrypt - mining Feathercoin to about 5+ khash per sec with NEON optimised codes by GCC's -ftree-vectorize vectorizer, some 20-30% improvements. And the cpu runs hotter during mining along with the higher hash rates which indicates an improvement in efficiency. This is probably a useful thing to have around as manually writing hand optimized assembly e.g. for scrypt-arm.S would likely take a lot of effort and is likely less portable. granted, -ftree-vectorize won't make the fastest codes, but that the improvement is decent with much less manual efforts needed to make optimized assembly codes.

note that neon codes may possibly not work on some ARM cpus which may not support NEON codes, as I think I chanced upon some specs that says A53 cpus the simd extensions is possibly *optional*.
e.g. it is quite possible that some A53 in the wild e.g. the 'cheap' ones may not have NEON in it, even if they are A53 cpus

It used to be that Raspberry Pis are deemed 'too slow' to do mining but Raspberry Pi4 with A72 ARM cores are just borderline and 'punch above its weight' to mine alongside the big Mhash per seconds gpus, the differences is easily 1:1000 though.

--
By just using those flags mentioned, gcc builds binaries with NEON SIMD using that -ftree-vectorize flag, along with the other flags as otherwise it doesn't turn on SIMD codes.
This is a 'quick and easy' way to at least get some NEON SIMD on aarch64 and it isn't too bad as i've described.

off-topic:
just to add a note, I tried to 'hand optimize' it by making a c source where I re-arrange the c arrays in salsa20 to fall into 'lanes' and using the same -ftree-optimize flags, however, instead of being faster the original codes are optimised better even though it actually used less NEON SIMD codes, i looked closer at the generated neon simd codes, I think the problem is that simply 're-arranging' the arrays won't cut it as between the iterations/loops the array is permuted, so that gets streamed out to memory, this is a bummer I'd think lots of stalls then it gets loaded back from memory into a different permuted array of registers.
While with the original codes, there is actually less SIMD. it seemed -ftree-vectorize and other optimizations simply used the normal registers for part of the codes and passing them into simd registers for some sections of the codes, that in itself is faster than the 'rearranged array' codes.
--
there is a minor gain with -ftree-vectorize for Neoscrypt as it spend a large number of loops in salsa20, 1000 x some 200 rounds?
hence, NEON SIMD could potentially speed that up significantly, the trouble is that salsa20
https://en.wikipedia.org/wiki/Salsa20
permutates, the arrays between the quarter rounds in each loop. I did a naive attempt by simply re-arrange the arrays in C codes so that they looked like they fall into 'lanes' (common to SIMD).
that oversimplified approach don't cut it with -ftree-vectorize, the registers get streamed out into memory (lots of wait states and cpu stalls for the small Raspberry Pi type boards and cpus).
but that hand optimized assembly won't be easy to write and that they'd take quite a lot of effort.
and the thing is this won't be the only thing that needs to be optimized.

Hence, for now the 'easy' way is to simply -ftree-vectorize with the other flags in bundle so that at least some form of NEON SIMD is achieved.
There is a decent gain like 20% (for Neoscrypt) with vs without the compiler generated SIMD codes.

Post

Topic

Board Mining (Altcoins)

Re: An (even more) optimized version of cpuminer (pooler's cpuminer, CPU-only)

ag1233

on 12/09/2023, 16:04:22 UTC

thank JayDDee, there is a minor gain with -ftree-vectorize for Neoscrypt as it spend a large number of loops in salsa20, 1000 x some 200 rounds?
hence, NEON SIMD could potentially speed that up significantly, the trouble is that salsa20
https://en.wikipedia.org/wiki/Salsa20
permutates, the arrays between the quarter rounds in each loop. I did a naive attempt by simply re-arrange the arrays in C codes so that they looked like they fall into 'lanes' (common to SIMD).
that oversimplified approach don't cut it with -ftree-vectorize, the registers get streamed out into memory (lots of wait states and cpu stalls for the small Raspberry Pi type boards and cpus).
but that hand optimized assembly won't be easy to write and that they'd take quite a lot of effort.
and the thing is this won't be the only thing that needs to be optimized.

Hence, for now the 'easy' way is to simply -ftree-vectorize with the other flags in bundle so that at least some form of NEON SIMD is achieved.

Post

Topic

Board Mining (Altcoins)

Re: An (even more) optimized version of cpuminer (pooler's cpuminer, CPU-only)

ag1233

on 12/09/2023, 14:59:40 UTC

thanks pooler, I'm thinking you may want to add an 'option' or document the flags mentioned, I think we'd leave the challenge of hand optimizing part of that code to some other time or if someone may want to take up the challenge.

By just using those flags mentioned, gcc builds binaries with NEON SIMD using that -ftree-vectorize flag, along with the other flags as otherwise it doesn't turn on SIMD codes.
This is a 'quick and easy' way to at least get some NEON SIMD on aarch64 and it isn't too bad as i've described.

Post

Topic

Board Mining (Altcoins)

Re: An (even more) optimized version of cpuminer (pooler's cpuminer, CPU-only)

ag1233

on 12/09/2023, 09:53:43 UTC

hi pooler,
are you still monitoring this thread?

hi all,
recently I tried getting cpuminer to run on a Raspberry Pi 4 (aarch64)
I'm using somewhat older codes (version 2.4) from
https://github.com/pooler/cpuminer

I couldn't figure out how to get it to build with the ARM assembly codes, and apparently scrypt-arm.S seemed to be written for armhf (32 bit ARM microprocessors).
Hence, there could possibly with issues compiling in aarch64 (ARM 64bit instructions and OS)

Among the things I tried, I added the following flags

Code:

-O2 -ftree-vectorize -ftree-slp-vectorize -ftree-loop-vectorize -ffast-math -ftree-vectorizer-verbose=7 -funsafe-loop-optimizations -funsafe-math-optimizations

I checked and try compiling with "-S" option which makes it generate assembly codes, apparently among the suite of flags used above, it causes GCC to generate assembly codes with NEON SIMD.
This is without specific hand optimized assembly. It may possibly still make NEON assembly with a few less flags (e.g. possibly less -ftree-slp-vectorize, but that I think this is useful even without NEON), but that when missing some of the above flags, NEON assembly isn't generated.

I tried with and without the above flags e.g. just -O2, there is at least a slight difference in hash rates, from about 4 khash per sec on all 4 cores doing Neoscrypt - mining Feathercoin to about 5+ khash per sec with NEON optimised codes by GCC's -ftree-vectorize vectorizer, some 20-30% improvements. And the cpu runs hotter during mining which indicates an improvement in efficiency. This is probably a useful thing to have around as manually writing hand optimized assembly e.g. for scrypt-arm.S would likely take a lot of effort and is likely less portable. granted, -ftree-vectorize won't make the fastest codes, but that the improvement is decent with much less manual efforts needed to make optimized assembly codes.

It used to be that Raspberry Pis are deemed 'too slow' to do mining but Raspberry Pi4 with A72 ARM cores are just borderline and 'punch above its weight' to mine alongside the big Mhash per seconds gpus, the differences is easily 1:1000 though.