Re: Thread about GPU-mining and Litecoin

Quote from: Schwede65 on March 06, 2012, 03:40:26 PM

i have heard on btc-e chat, that mtrlt made a half-speed of the reaper13-DEMO...

the full version should work twice as fast...

so the ltc-gpu-mining is profitable down to ~0.0007 btc/ltc and not for the public with ~0.0014...

i think there must be a new ltc-algo to save the public from the grifter(s)...

Edit: the half-speed-reaper13-DEMO-release seems to be the compromise between giving the release or not

The speeds reported by mtrlt were the same as the release for the demo, the rumor as I understood it was that his new SC2 miner was supposed to be twice as fast for GPU

I've wondered if the reason mtrlt has not released the source code is because it may borrow heavily from ssvb's code, as ssvb was the one to solve the in-place hashing of ltc

Anyway, this outlines the difficulty in actually a creating an algorithm that can not easily be serialized so that it will run faster on a CPU than a GPU. I've been thinking about it some, and I've wondered if using multiple algorithms and randomizing a number of the settings for each (then possibly the order in which they are used) to generate 10s or hundreds of thousands of possible algorithms, only one of which could possibly decrypt the next block. The difficulty in parallelizing the data would have to do with the difficulty of assembling a large number of algorithms sequentially and then assessing them. It's hard to think of something that a CPU can due better than a GPU when it's a repetitive task based on the same operations of a dataset... Having to brute force the actual construction of the algorithm may be something that a GPU might struggle with, though.

edit: There's a pretty neat paper of algorithms and their runtime comparisons for GPU/CPU. There are sorting and physics algorithms that perform significantly slower on a CPU as compared to a GPU.

www.cs.utexas.edu/users/ckkim/papers/isca10_ckkim.pdf

More info on the mentioned sorting algorithm:

Quote

Our 4-core implementation is competitive with the per-
formance on any of the other modern architectures, even
though Cell/GPU architectures have at least 2X more
compute power and bandwidth. Our performance is
1.6X4X faster than Cell architecture and 1.7X2X faster
than the latest Nvidia GPUs (8800 GTX and
Quadro FX 5600).

Also note that as the set size becomes larger, GPUs run out of memory and are unable to even compute the set, although extrapolating from the data even if they were able to it would still be slower than for the CPU.
http://pcl.intel-research.net/publications/sorting_vldb08.pdf

The algorithm was subsequently surpassed with this GPU one, however for small data sets Intel's TBB parallel sort is still faster.
http://mgarland.org/files/papers/gpusort-ipdps09.pdf
Apparently over the past few years there has been a battle between Intel and nVidia to try to find things that CPUs and GPUs do better than one another, and there is a wealth of well-cited literature out there.

The algorithm was again overhauled, and CPU-based radix/merge sort still manages to beat out GPUs in a number of cases:

Quote

Comparing CPUs and GPUs: In terms of absolute performance
of CPU versus GPU, we find that the best radix sort, the CPU radix
sort outperforms the best GPU sort by about 20%. The primary rea-
son is that scalar buffer code performs badly on the GPU. This ne-
cessitates a move to the split code that has many more instructions
than the buffer code. This is enough to overcome the 3X higher
compute flops available on the GPU4 . On the other hand, the GPU
merge sort does perform slightly better than the CPU merge sort,
but the difference is still small. The difference is due to the absence
of a single instruction scatter, and the additional overheads, such as
index computations, affecting GPU performance.

http://dl.acm.org/citation.cfm?id=1807207

So, it seems reasonable that an algorithm which is heavily dependent on the radix sort implemented will perform faster on CPUs as compared to GPUs. It's important that only non-multifield data be used, because a very fast GPU algorithm was implemented for that recently:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5713164

There's also a tree search algorithm here which runs significantly faster with smaller data sets on CPUs/MICA versus GPUs; if it could be integrated into an encryption algorithm it might destroy GPU performance

www.webislands.net/pubs/FAST__SIGMOD10.pdf