The claimed avg "expected of 2w^(1/2) group operations" is not achievable with m = w^(1/2))/2 (mean jump size from acticles by Pollard)Empirical tests have shown that 2w1/2 is achievable at m * 8.I ask you to check on your implementation what speed will be if the MID/MAX step size is increased by 8 times, plz.thx, no need
all norm after add
# get JmaxofSp
def getJmaxofSp(optimalmeanjumpsize, dS):
if flag_verbose > 0:
print('[optimal_mean_jumpsize] %s' % optimalmeanjumpsize)
sumjumpsize = 0
for i in range(1,len(dS)):
#sumjumpsize = (2**i)-1
#sumjumpsize += 2**(i-1)
sumjumpsize += dS[i-1]
now_meanjumpsize = int(round(1.0*(sumjumpsize)/(i)))
#next_meanjumpsize = int(round(1.0*(sumjumpsize+2**i)/(i+1)))
next_meanjumpsize = int(round(1.0*(sumjumpsize+dS[i])/(i+1)))
if flag_verbose > 1:
print('[meanjumpsize#Sp[%d]] %s(now) <= %s(optimal) <= %s(next)' % (i, now_meanjumpsize, optimalmeanjumpsize, next_meanjumpsize ))
if optimalmeanjumpsize - now_meanjumpsize <= next_meanjumpsize - optimalmeanjumpsize :
if flag_verbose > 0:
print('[meanjumpsize#Sp[%d]] %s(now) <= %s(optimal) <= %s(next)' % (i, now_meanjumpsize, optimalmeanjumpsize, next_meanjumpsize ))
if flag_verbose > 0:
print('[JmaxofSp] Sp[%s]=%s nearer to optimal mean jumpsize of Sp set' % (i, now_meanjumpsize))
return i
print("\n[FATAL_ERROR] JmaxofSp not defined!\n"); exit(-1)
#################
about speed
We have updated the OP with our latest results. Here is a summary of the results of runs for tests 0 through 49:...
That is already good result!
no, absolutely not
August 26, 2019
pkey Standard OpenSSL library secp256k1 (S 8X32, F 10X26)
bits hops/second test run time hops/second test run time
==== =========================== ===========================
31 1942.378857 28.4496 secs 5520.634781 10.0240 secs
Standard OpenSSL library averaged 1897 hops per second
secp256k1 (S 8X32, F 10X26) averaged 5258 hops per second
September 09, 2019 (secp256k1)
test time hps hops
---- -------------- ------- ----------
31 9.2328 secs 11,884 109705
already better, but..
it very slow for 1core C/C++ openssl/secp256k1, it should be more!
(even if u built early classic 1 Wild algo with 3.28w^(1/2), this does not explain x10 speed drop)
We were just trying to quickly get everything to work so we have not optimized anything yet.
ok, we expect and hope
see for compare
Algo: 1 Tame + 1 Wild with distinguished points, expected of 2w^(1/2) group operations
avg stat after 10tests
1core i5-2540, python37x64, win7x64
[i] 8.6K j/s; ...
...
[################################################]
[averages] expected of 2w^(1/2) group operations
-------|--------/--------|---------------------------------/---------------------------------|
W |jump avg/2w^(1/2)| time avg/2w^(1/2) |
-------|--------/--------|---------------------------------/---------------------------------|
>2^031 | 84.0K/ 92.7K | 0y 0m 0d 00:00:10s 002ms / 0y 0m 0d 00:00:11s 042ms |
----------------------------------------------------------------------------------------------
(it raw python!)
1core i5-2540, python37x64 + gmpy2, win7x64
[i] 73.6K j/s; ...
...
[################################################]
[averages] expected of 2w^(1/2) group operations
-------|--------/--------|---------------------------------/---------------------------------|
W |jump avg/2w^(1/2)| time avg/2w^(1/2) |
-------|--------/--------|---------------------------------/---------------------------------|
>2^031 | 90.0K/ 92.7K | 0y 0m 0d 00:00:01s 150ms / 0y 0m 0d 00:00:01s 185ms |
----------------------------------------------------------------------------------------------
#################
about secp256k1 library
Legendary main question from legendary master

root of all ecc problems)
Are you using affine or jacobian coordinates for the points?
42: Answer to the Ultimate Question of Life, the Universe, and Everything"secp256k1 have functions J+J->J , J+A->J , and no have functions A+A->A (Is this a precisely optimized library for bitcoin?

)
(all simple, because in real often need multiply rand point instead adding, and jacobian is better for multiply)
J+J->J (12M, 4S) - bad choice
J+A->J (8M, 3S) - already better
(look eprint.iacr.org/2016/103.pdf)
you can take the code from break_short by Jochen Hoenicke, it fits perfectly.
this code uses J+A->J and convert only X to Affine xJ ->xA (its max what we have in secp256k1 as it is)
where A is S-table with pre-computed G,2G,4G,8G,16G..
..u used pre-computed, y? not say me what u calc every Si permanently used multiplication, nonono,plz,no! X%mask instead pow2(X%k)?
(..by the way, that would explain your terrible speed..)
..and use old alternatively functions without protection against side channel attacks, +10-20% speed
secp256k1_gej_add_ge_var() instead secp256k1_gej_add_ge()
(warn, _var() need normalize before use in some functions.. my early mistake, "leading zeros" github.com/bitcoin-core/secp256k1/issues/601)
in ideal - need write new functions:
A+A->A (1I, 2M, 1S) - best choice, because in the end we need X in affine for unique representation each point to adding in storage hashtable
..u used hashtable, y?)
after which do not need to spend 8M,3S, only nearly 20M to each 1I, it +35% speed-up
like that
########
I seem to know what the problem is.
This is a very obvious fact for me, I should have started with it.
(I should have guessed at a time when about ECMULT_WINDOW_SIZE)
No multiplies the points inside the loop!I warn, multiplication points is very expensive, more expensive than the slowest addition points.
Shouldn't use it at all in multi-million integration cycles.
You must calculate all possible displacement points before the start of the cycle.
Then just add them, being inside cycle.
I think you are using a mask
https://bitcointalk.org/index.php?topic=5166284.msg52068840#msg52068840..so you cant store several million pre-calculated points in memory. it explains your choice.
x = PRF(X)
= 1 << (X % m)
...
ec_point_mul_and_add(r,a,n)
not mask, i see,.. then simple, u tried built classic pattern of a*b+c
..rewritten the following functions..
I tell you - take loop from break_short as it is (with standart functions), rewrite to kangaroos, and benchmark it.
if after you still want to rewrite what, then ok (but i think what not)
I think the "secret sauce" will not create, it will turn out an exotic slow implementation through multiplication.
I would like to be mistaken, but my experience says that you lose time.
..How do I know about multiplication?.. I checked it myself,.. and from the topic about why brainflayer is slow and why it cannot be done faster,.. and it was on one of the topics pages LBC.
This is not the most famous fact, it is not surprising that you did not know about it, no fault.
Its not random, not brainflayer, its
pseudo random walk, so we can pre-compute our jumps size (offset points) for using addition points instead multiplication points.