Post
Topic
Board Altcoin Discussion
Re: [neㄘcash, ᨇcash, net⚷eys, or viᖚes?] Name AnonyMint's vapor coin?
by
TPTB_need_war
on 30/01/2016, 08:42:50 UTC
I decided to publish a section of my research on memory hard PoW hash functions from 2013. This is only the first section where I analyzed Scrypt. I may have published this section before when Monero was first announced and I was publicly debating with one of the Monero dudes about Cryptonote's PoW hash Cryptonite (yet another discussion that turned condescending and foul mouthed). Note there are 23 references cited in the complete version of this paper. This explains why Cryptonite employs AES-NI instructions to defeat the FLOPS and superior memory bandwidth advantage of the GPU.

I claim my analysis of tromp's Cuckoo below predated this as I added it to my paper immediately after tromp posted about his new paper:

David Andersen. A public review of cuckoo cycle. http://www.cs.cmu.edu/dga/crypto/cuckoo/analysis.pdf, 2014
http://da-data.blogspot.com/2014/03/a-public-review-of-cuckoo-cycle.html



Scrypt ROMix

   1: X <- B
   2: for i = 0 to N-1 do
   3:    V[ i] <- X
   4:    X <- H(X)
   5: end for
   6: for i = 0 to N-1 do
   7:    j <- Integerify(X) mod N
   8:    X <- H(X ^ V[j])
   9: end for
  10: B' <- X

Without parallelism the execution time of ROMix is bounded by that of the hash function H or the random access memory latency to read V[j].

Without parallelism the memory bandwidth to write V[ i] can not be a significant factor because in the case that the execution time of the first loop is dependent on memory bandwidth instead of H and not random access latency since the writes are sequential, the random access latency to read V[j] in the second loop is slower than the memory bandwidth in the first loop.

Percival's sequential memory-hard proof [1] states that redundant recomputation of H in exchange for a reduced memory footprint will not be asymptotically faster. For example even if H is asymptotically (as N goes to infinity) perfectly distributed in the Random Oracle model so the second loop only accesses each V[j] at most once, the H for each second element will be computed twice when reducing the memory footprint by ⅔ by recomputing the H for each second and third element in the second loop instead of retrieved from memory.

Percival's proof fails if the random access latency is not insignificant compared to the execution time of H because the execution time for H in a parallel thread is free as it masked by the other thread is which is stalled for the duration of the latency. This is why BlockMix is employed to increase the execution time of H.

Consider the example that the execution of H is twice as fast as the random access memory latency, i.e. H executes in ½ the delay of each random access. Analogous to cpuminer's "lookup gap" of 3, the computation of H for each second and third elements of V[j] can be repeated in the second loop instead of retrieved from memory. Thus ⅓ the memory requirements, average execution time equal to the latency (indicated by the computed value of 1 below), and only a ½ × ⅓ = 1/6 average increase in computational cost for accessing the third elements which is masked by ½ of the latency not offset by H in line 8. Each first, second, and third element of V[j] has a ⅓ probability of being accessed, so the relative execution time is computed as follows.

   ½ × ⅓ + (½ + ½) × ⅓ + (½ + ½ + ½) × ⅓ = 1

Since GPU DDR main memory has nearly an order-of-magnitude slower random access memory latency than the CPU's DRAM, the GPU employs a "lookup gap" to reduce the memory footprint to allow more parallel instances of Scrypt to execute simultaneously up to the available number of threads and memory bandwidth. The GPU's order-of-magnitude faster memory bandwidth allows running more parallel instances of the first loop. Thus the superior FLOPs of the GPU is fully utilized, making it faster than the CPU.

L3crypt

[...]

Even without employing "lookup gap", the GPU could potentially execute more than 200 concurrent instances of L3crypt to leverage its superior FLOPs and offset the 25x slower main memory latency and the CPU's 8 hyperthreads. To defeat this, the output of L3crypt should be hashed with a cryptographic hash the leverages the CPU's AES-NI instructions and with enough rounds to roughly equal the computation time of L3crypt. GPUs are roughly at parity with AES-NI in hashes per watt [24].

[...]

Asymmetric Validation

Verification—that a hash output is valid for a given input—should be orders-of-magnitude more efficient than computing the hash.

The validation ratio when used in a proof-of-work system also depends on how fast the hash is, because validation only requires one hash whereas solving proof-of-work requires as many hashes as the collective difficulty or more importantly the pool's minimum share difficulty.

The Cuckoo Cycle hash [19] significantly increases the asymmetric validation ratio, but has unprovable security because its algorithm is based on permutated orderings which do not incorporate diffusion and confusion [20] and thus not provably secure in the Random Oracle model, i.e. we can't prove there aren't algorithms to speed up the solutions. Although its entropy is large, i.e. the factorial permutation of all buckets in the hash space, it doesn't require that its entire memory space be accessed, thus possibly bit flags could be employed to reduce the memory used and make it faster.

The Cuckoo Cycle hash isn't a personal computer only hash because it requires only a minimal CPU and it is non-sequentially latency bound. It can be parallelized over the same memory space masking much of the latency. Thus for example the Tilera server CPUs will outperform Intel's personal computer CPUs because Tilera's has 3X to 12X more hardware threads per watt and isn't plagued by the GPU's slow main memory latency. Whereas for L3crypt no further parallelization is possible so even though compared to the 8-core Xeon E5 or Haswell-E the Tilera has same L3 cache [21] and 50% to 67% of the power consumption, the latency is 2X greater [22] and each clock cycle is 2X to 3X slower. Although parallelization can be applied for computing H to try to match Intel's AVX2 acceleration, L3crypt is sequentially memory latency bound.

[...]

Future Proof


CPU memory bandwidth is doubling approximately every four years [7] with up to a 50% improvement expected by 2015 [6] and memory size is doubling approximately every two years which why Moore's Law expects a doubling of performance every 18 months [8] computed as follows.

   2^(years/2) × 2^(years/4) = 2^(years/1.5)

However Percival noted that memory latency is not following Moore's Law [1].

References

[1] Percival, Colin. Stronger key derivation via sequential memory-hard functions.
    BSDCan'09, May 2009. http://www.tarsnap.com/scrypt/scrypt.pdf

[...]

[6] http://www.pcworld.com/article/2050260/hefty-price-premium-awaits-early-ddr4-memory-adopters.html

[7] http://www.timeline-help.com/computer-memory-timeline-4.html

[8] http://en.wikipedia.org/wiki/Moore's_law#cite_note-IntelInterview-2

[...]

[19] https://github.com/tromp/cuckoo/blob/master/cuckoo.pdf

[20] http://en.wikipedia.org/wiki/Confusion_and_diffusion
     http://www.theamazingking.com/crypto-block.php

[21] http://www.tilera.com/sites/default/files/productbriefs/TILE-Gx8072_PB041-03_WEB.pdf

[22] http://www.tilera.com/scm/docs/UG101-User-Architecture-Reference.pdf#page=369

[...]

[24] https://www.iacr.org/workshops/ches/ches2010/presentations/CHES2010_Session06_Talk03.pdf#page=16