Some takeaways so far...
- DER encodes negative values with an extra leading zero byte; so this means the PoW effort is double (for example, I found the 3 bytes puzzle after more than 33 million hashes, not after 16.7 million)
- two possible nonces (-1/2 and 1/2) = 2x times more S candidates (not just (z+r)*2)
- 2 SHA256 + 1 RIPEMD160 for every out address candidate.
I'm now writing the search in CUDA to speed up by 1.000.000 than the shitty pure Python.