Post
Topic
Board Altcoin Discussion
Re: The Ethereum Paradox
by
finitemaz
on 09/06/2016, 17:35:51 UTC
Cuckoo Cycle doesn't rely on cryptographic security of its underlying hash function,
and Dan Bernstein himself attested to its use in Cuckoo Cycle being perfectly sound.
Please provide me a copy of that discussion, because I believe I have already shown it is potentially unsound in some logic I have written down and not yet revealed.
I spoke to him in person at BITCOIN'15 where I presented Cuckoo Cycle.

Did he consider  (especially parallelized) multi-collision attacks and their impact on the potential to find an algorithm that can find cycles faster? Can you paraphrase what he said or the gist of his reasoning?

Attacks on hash functions are mostly about doing tons of computation (but less than try all inputs)
to try invert its outputs. Siphash may not stand up to that. But to think that a weakness in inversion
resistance can make finding cycles of length 42 in a graph defined by millions or billions of siphash
outputs easier is just... utterly inconceivable. That's the gist of it.

Please distinguish preimage attacks from multi-collision attacks. If we can find certain characteristics in the hash function which are not perfectly random and are repeatable relationship with certain changes of bits in the key from hash-to-hash, then we can perhaps impact the cycling in the Cuckoo. Cycle forming or the way we optimize detecting them, may be impacting by certain patterning away from a perfect Random Oracle.

Think of a differential cryptanalysis where we look for the way a hash changes over differentials from one hash to the next. And also think of the parallelized rho attack. The weakness in Cuckcoo w.r.t. to hashing it is isn't doing one hash, but millions of them and so the dynamic randomness of correlation over hash differentials also has to be considered.

I am not comfortable with cheating margins of security on the hash function. Sorry. I think Zooko @ Zcash was correct to be concerned even though he didn't express what sort of attack might concern him. And remember he worked on the Blake2s cryptographic hash function but I understand he isn't a cryptographer per se (and neither am I). Crypto-currency is also about being able to trust the cryptography is strong and sound. I don't see the advantage of cheating on the hash function used in the Cuckoo Cycle, especially given I don't think you attain ASIC resistance any way, which was the entire motivation for replacing SHA256 with Siphash in order to lower the computation to memory consumption ratio.

You are apparently thinking of inversion and not about what matters. I doubt Dan really thought deeply about what he was commenting on off-the-cuff.

And Btw, you write about in our past discussions and also in your white paper (or blog) about 1/3 computation and 2/3 memory accesses, but this is execution time, not energy cost. The power cost of loading a row buffer on DRAM is very cheap because the circuits are much more dense and the page is localized. The cost of computation on a CPU is very costly because the circuits are spread out all over the chip and not as dense as 1 transistor per bit of DRAM. I think it is a losing strategy to try to say 1 hash per 1 memory access (meaning you expect 1 memory access to load a new memory row) is going to only result in a 1/3 speedup for the ASIC. And besides below I will explain how I think the assumption of 1 hash per memory row buffer load can I think be avoided by the ASIC running many h/w threads.

I think you are referring to something entirely unrelated to what I was referring to. I see int here in the code I am referring to:

https://github.com/tromp/cuckoo/blob/gh-pages/cuckoo.c#L10

That's the basic algorithm, not the reference miner, which is in
https://github.com/tromp/cuckoo/blob/master/src/cuckoo_miner.h
for CPUs and
https://github.com/tromp/cuckoo/blob/master/src/cuda_miner.cu
for GPUs.
These spend most time doing edge trimming,
so that's what this discussion should focus on.

I was referring to the "simple array in the basic algorithm" which was from the February 1, 2014 version of your white paper. I note you mentioned the edge trimming in the December 31, 2014 update.

I was avoiding studying the edge trimming code and was focusing on the "basic algorithm".

Any way, if the edge trimming is done after running enumerating all possible nonces, then it is not a heuristic and is an optimization to reduce memory consumption by eliminating all the edges which can't be in a cycle and then I presume you re-enumerate the nonces on the "basic algorithm" but use a hash table to represent a sparse array so you don't have to store all the buckets from the more naive version of the basic algorithm.

So it is reducing memory consumption and I presume eliminating the wasted code paths for walking leaf edges in the final basic algorithm.

Btw, you may not remember that AnonyMint or TPTB_need_war wrote to you a long time ago (and I think perhaps before Dave Andersen did) that you could compress and count with bit flags. So you should credit Shelby too, lol (but Dave wrote a much more detailed exposition). Do I need to go find that post on BCT where AnonyMint or TPTB_need_war told you that? I remember reading it again this week when I was doing my deeper research on Cuckoo Cycle.

In any case, it seems it doesn't change my theory about the lack of ASIC resistance. Rather it just reduces the size of the data being randomly read/write from 32-bits to 2-bits for the majority of running time.

Also you do note that the edge trimming is applicable for the case where many/most of the edges will leaf nodes, which is the case where M/N < 1/2. You may think that is acceptable, because your intended usage is apparently to run Cuckoo at that load. I note it for reasons I will not reveal now.  Wink

Okay so now we are on the same page, so I can proceed to reply to what you had written before...

In one round of edge trimming, Cuckoo Cycle typically updates 2^29 counters
spread over 2^8 memory banks. Each memory bank thus holds 2^21 counters,
only 2^14 of which typically fit in a single row. This is where the latency comes in.
To avoid latency, you could increase the number of memory banks by a factor of 2^7,
but this similarly increases die area, hardware costs, and energy usage.
Or increase the number of h/w threads to 2^15 (2^8 multiplied by 2^21 bits for 2-bit counters divided by 2^14 bits per row) which have some efficient h/w mechanism for waiting to be synchronized on a memory page/row.

The point is that if you have enough threads running, the likelihood is there are 2^12 threads wanting to read from the same row, thus your algorithm is no longer bounded on memory latency rather on memory bandwidth and moreover that there isn't 1 hash per memory row buffer load but instead 2^12 hashes, and thus actually your algorithm is computation bound and will be 1000s times more power efficient on the ASIC.

I don't see why you multiply with the number of rows that Cuckoo uses per memory bank.
Your 2^15 threads will each hash some nonce and then most will want to access a unique row.
Then what? There's little to be synchronized.

Alternatively, the algorithm parameter PART_BITS allows for a reduction in the number
of counters in use at the same time, which is what your proposal essentially amounts to.
Setting this to 21 will require only 2^8 counters,
one per memory bank. But now your hash computations have increased by a factor 2^21,
over 2 million.
No that is not equivalent to increasing the number of h/w threads and syncing them to pause until 2^13 of them are queued up to read a shared memory page/row.


In my code a thread is something that processes a sequence of nonces, hashing each one, and
then atomically updating a corresponding counter, without any coordination.

Are you proposing a thread for each nonce? Half a billion threads?
I really don't understand what you are proposing and how this syncing is supposed to be implemented.

It doesn't matter that the thread proceeds to do something else after read a nonce. It only matters that enough h/w threads are running to enable syncing many reads on the same row buffer load without stalling all the threads and gridlock.

The ASIC will need some custom timing and syncing logic.

I think you really didn't understand what Bill Cox was trying to explain.

Wasting time! I want to see the release! Wink