Re: [neㄘcash, ᨇcash, net⚷eys, or viᖚes?] Name AnonyMint's vapor coin?

Quote from: tromp on February 07, 2016, 04:03:46 PM

Quote from: TPTB_need_war on February 06, 2016, 08:14:59 PM

However, what a GPU (which starts with 4 - 10X worse main memory latency than CPUs)

Where do you get those numbers? What I can measure is that a GPU has a 5x higher throughput
of random memory accesses. I don't know to what extent that is due to more memory banks in the GPU
but that makes it hard to believe your numbers.

From my old rough draft:

Quote

The random access latency of Intel's L3 cache [13] is 4 times faster than DRAM main memory [2] and 25 times faster than GPU DDR main memory [14].

[14] http://www.sisoftware.co.uk/?d=qa&f=gpu_mem_latency&l=en&a=9
GPU Computing Gems, Volume 2, Table 1.1, Section 1.2 Memory Performance

Unfortunately that cited page has disappeared since 2013. You can use their software to measure it. That is referring to one sequential process.

You are referring to the latency when the GPU is running multiple instances or (in Cuckoo's case) otherwise exploiting parallelism in the PoW proving function. Of course the latency drops then, because GPU is able schedule simultaneous accesses to the same memory bank (or can schedule accesses to more than one memory bank simultaneously? ... I read DRAM gets faster because of increasing parallelism)

Edit: Try these:

http://courses.cms.caltech.edu/cs101gpu/2015_lectures/cs179_2015_lec05.pdf#page=11
http://stackoverflow.com/questions/13888749/what-are-the-latencies-of-gpu

Edit#2: http://arxiv.org/pdf/1509.02308.pdf#page=11

Quote from: tromp on February 07, 2016, 04:03:46 PM

Quote from: TPTB_need_war on February 06, 2016, 08:14:59 PM

and especially an ASIC will do to get better DRAM amortization (if not also lower electricity consumption due to less latency) is run dozens or hundreds of instances of the proving algorithm with the memory spaces interleaved such that the latencies are combined and amortized over all instances, so that the effective latency drops (because reading from the same memory bank of DRAM is latency free if multiple accesses within the same bank are combined into the same transaction).

This make no sense to me. When all your memory banks are already busy switching rows on every
(random) memory access, then every additional PoW instance you run will just slow things down.
You cannot combine multiple random accesses because the odds of them being in the same row
is around 2^-14 (number of rows).

If the odds are great enough then I agree, and that is why I said increasing the size of memory space helps. Example, for a 128KB memory space with 32 KB memory banks then the odds will only be roughly 1/4 (actually the computation is more complex than that), not 2^-14.

I am not expert on the size of memory banks and the implications of increasing them.