Search content
Sort by

Showing 20 of 39 results by ssvb
Post
Topic
Board Altcoin Discussion
Re: OpenCL CPU miner for Scrypt
by
ssvb
on 26/09/2013, 12:49:09 UTC
Good luck trying to beat cpuminer when you are hindered by an extra abstraction layer and can't use assembly instructions directly Smiley
Post
Topic
Board Altcoin Discussion
Re: LTC miner optimizations for PowerPC (Power Mac) and Cell/BE (PlayStation 3)
by
ssvb
on 12/09/2013, 09:44:38 UTC
In debian you can try "apt-get install build-essential libtool libltdl-dev automake" (I hope I did not forget something in this list). The autogen.sh script needs autotools installed to generate the configure script. This stuff is often already installed on the computers used for software development, but may be indeed missing on some systems.

edit: hmm, actually based on the error message, the autotools might be already fine. Try "apt-get install libcurl-dev" to see if this resolves the curl related issues.
Post
Topic
Board Altcoin Discussion
Re: A RAM based fpga LTC miner
by
ssvb
on 01/09/2013, 03:22:02 UTC
You used a lot of double speak.
Nah, it's just you still having some trouble understanding Smiley

Quote
First I am aware of the space time tradeoff however rather than explain it in every single post it is useful to look at the max scratchpad size.   128KB scratchpad is going to require less memory and less bandwidth than a 16MB scratchpad if everything else is the same.
Let's have a look at the definition of what is "memory hard" in the scrypt paper: "A memory-hard algorithm is thus an algorithm which asymptotically uses almost as many memory locations as it uses operations; it can also be thought of as an algorithm which comes close to using the most memory possible for a given number of operations, since by treating memory addresses as keys to a hash table it is trivial to limit a Random Access Machine to an address space proportional to its running time", "Theorem 2. The function SMixr(B, N) can be computed in 4 * N * r applications of the Salsa20/8 core using 1024 * N * r + O(r) bits of storage"

You can see that scrypt is just equally memory hard for all the scratchpad sizes. The ratio between the number of scratchpad access operations and the number of salsa20/8 calculations remains the same.

Quote
As for higher parameter value having no effect on the relative performance of CPU, GPU, and FGPA/ASICs that is just false.
What I said was "Increasing the size of the scratchpad is not going to bring any improvements (if by improvements you mean making CPU mining more competitive)". How did it turn into "no effect on the relative performance of CPU, GPU, and FGPA/ASICs"? CPU miners are at a serious disadvantage right now, so the effect on the relative performance must be really significant in favour of CPU in order to turn the tables.

In practice, increasing the size of the scratchpad will make it harder to fit in CPU caches. To mitigate the unwanted latency of random accesses, scrypt uses parameter 'r'. Basically if r=1 (the default for LTC), then the  scratchpad is accessed as 128 byte chunks at random locations. If r=8, then the memory accesses are done as 1024 byte chunks at random locations. In the former case, the cache miss penalty is hit once per 128 bytes. In the latter case, the cache miss penalty is hit once per 1024 bytes (the sequential accesses after the first cache miss are automatically prefetched, at least in theory). Having high 'r' value reduces the effect of memory access latency penalty for the CPU. And the latency is not an issue for the GPU in the first place. Additionally, if the CPU has to access the memory, then the memory controller must have enough bandwidth. For example, my Core i7 860 processor currently has ~29 kHash/s performance in cpuminer. And the STREAM benchmark (built as multithreaded with OpenMP support) shows ~10GB/s of practically available memory bandwidth. These ~10GB/s of memory bandwidth would translate to the theoretical hard hashing speed limit ~38 kHash/s if the CPU caches were not helping. There is not much headroom as I can see, and my processor does not even have AVX2.

Quote
Scrypt was designed to be GPU and specialized device resistant.   This is important in password hashing as most servers are using CPU and attacker will likely choose the most effective component for brute forcing.  By making CPU performance superior it prevents attackers from gaining an advantage. You can test this yourself.  Modify cgminer OpenCL kernel to use a higher p value.  Around 2^14 GPU relative performance is essentially gone.  It is comparable to a CPU throughput.  At 2^16 GPU relative performance is falling far behind.
This is confusing, did you actually mean the 'N' value? Please just provide the patches for your changes to cgminer and cpuminer that you used for this comparison.

But in general, the GPU tuning is not easy because there are many parameters to tweak. Poorly selected configuration can result in poor hashing performance even for the LTC scrypt. You can find many requests for help with the configuration in the forum. So your poor performance report does not mean anything.

Quote
At 2^20 the GPU never completes.
And surely you can raise the memory requirements so high, that they would make mining problematic on the current generation of the video cards purely thanks to insufficient amount of GDDR5 memory. But guess what? In a year or so, the next generation of video cards will have more memory and suddenly GPU mining will again become seriously better than on the CPU. Designing the algorithm around some magic limits which may become ineffective at any time is not the best idea. The current "small" scratchpad size for scrypt focuses on memory bandwidth instead of relying on artificial limits such as memory size (which can be easily increased, especially in the custom built devices).

Quote
You say one one hand that the memory requirement doesn't matter and on the other hand that FPGA are hard because they need lots of memory and wide buses.  Well guess what the higher the p value the MORE memory and wider busses that is needed.  At 2^14 roughly 128x the max scratchpad size is going to mean 128x as much bandwidth is necessary.
Yes, but only if also backed by roughly 128x more computational power. And likewise, the enormous computational power of FPGA/ASIC must be backed by a lot of memory bandwidth, otherwise it will be wasted.

Quote
So the lower the p value the EASIER the job is for FPGA and ASIC builders.  They can use less memory and narrower busses that means less cost, less complexity, higher ROI%.   Sure one isn't required to use max scratchpad size because one can compute on the fly but once again the whole point to the space-time tradeoff is that the advantage to doing so is reduced.
They can't use slower external memory, because it already needs to be damn fast.

Quote
Lastly yes the 128KB is per core but so is the 16MB using the default parameters.   If 128KB per core increases memory, bandwidth, and/or die size per core then a 16MB requirement would maker it even harder.
Yes, the absolute hashing speed would just drop significantly with the 16MB scratchpad. But it would drop on CPU, GPU, FPGA or any other kind of mining device.

Quote
So yes the parameters chosen by LTC makes it 128x less memory hard than the default.
Sigh. Please just read the definition of "memory hard" in the scrypt paper.

Quote
You use circular logic to say the max scratch pad size is irrelevant because one can optimize the size of the scratchpad to available resources.  This doesn't change the fact that due to the space-time tradeoff you aren't gaining relative performance.  Using a higher max scatchpad requires either more memory and bandwidth OR requires more computation.  The throughput on the FPGA,  GPU, CPU is going to be reduced.  Now if they were all reduced equally it wouldn't matter all that matters is relative not nominal performance.
Wait a second, where does this "reduced equally" come from? The space-time tradeoff just means that if you have a system with excessive computational power but slow memory, then you can still tweak lookup-gap to trade one for another. That is instead of being at a huge disadvantage compared to more optimally balanced system. This kinda "equalizes" the systems with vastly different specs, which is a total opposite of "reduces equally".

Quote
However the LTC parameters chosen are horrible for CPU usage.   CPU have a limited ability for parallel execution.  Usually 4 or 8 independent cores.   128KB per core * 8 = 1MB.
This just means that you don't know much about the CPU mining. The point is that modern superscalar processors can execute more than one instruction per cycle, this is called instruction level parallelism. Also there are instruction latencies to take care of. In order to fully utilize the CPU pipeline, each thread has to calculate multiple independent hashes in parallel. Right now cpuminer calculates 3 hashes at once per thread (or even 6 with AVX2). Now do the math.

Quote
That's right today with systems that can install multiple GB for very cheap cost the Scrypt paramters chosen bottleneck performance on a CPU.  GPU on the other hand are highly parallel execution engines but they have limited memory and that memory is at a higher cost than CPU have access to.
The memory must be also fast, not just large.

TL/DR

For the external memory, I'm assuming that sufficient size is available to be used for as many cores as practically useful (in this case only the memory bandwidth is an important factor). For the on-chip SRAM memory, the bandwidth should be not a problem as the memory can be tightly coupled with each scrypt core, but the size does matter and can't be large enough (the CPU caches are really small when compared with the DDR memory modules for a reason). The current best performing scrypt mining devices (AMD video cards) are relying on the external memory bandwidth. This FPGA design seems to be essentially a GPU clone.
Post
Topic
Board Altcoin Discussion
Re: A RAM based fpga LTC miner
by
ssvb
on 31/08/2013, 22:01:27 UTC
LTC uses the parameters (2^10, 1, 1) which results in a token 128KB max scratchpad size.  That isn't a typo it is kilobytes.
You are just forgetting to multiply this scratchpad size by the number of "cores", "threads" or some other entities (the way how you call them depends on the underlying technology) in the miner device. All these "cores" are simultaneously doing hashes calculations, each with its own scratchpad. The reason why FPGAs and ASICs work so great for SHA-256 is that the number of gates needed for a single SHA-256 "core" is really small, so one can fit an enormous amount of such cores on a single chip. But each scrypt "core" needs a scratchpad for storing intermediate data, and if the scratchpad is implemented as a SRAM memory, then the number of gates per scrypt "core" just skyrockets. You can fit significantly less scrypt "cores" on a single chip than SHA-256 "cores". There are some tricks for the scratchpad size reduction (LOOKUP_GAP is the right keyword, you can search for it in the forum), which reduce the size of the scratchpad, but this reduction is not free and results in more computations. That's why you can see some people mentioning Space–time tradeoff. The optimal lookup-gap setup depends on the balance between the memory size/performance and the computational power for doing arithmetic operations. It is also possible to use the external memory instead of on-chip SRAM, but the external memory must naturally have wide buses and a lot of bandwidth (memory latency is not critical for scrypt though). The scrypt GPU miners are relying on the GDDR5 speed, with a popular scratchpad size configuration being 64KB (lookup-gap=2), which indicates that the memory speed is the bottleneck and the excessive computational power is already traded off in order to reduce the burden on the memory.

I also suggest checking https://github.com/ckolivas/cgminer/blob/master/SCRYPT-README to find a lot of information, which is intended to be user-comprehensible:
"--lookup-gap
This tunes a compromise between ram usage and performance. Performance peaks at a gap of 2, but increasing the gap can save you some GPU ram, but almost always at the cost of significant loss of hashrate. Setting lookup gap overrides the default of 2, but cgminer will use the --shaders value to choose a thread-concurrency if you haven't chosen one.
SUMMARY: Don't touch this"


Quote
The default Scrypt parameters (2^14, 8, 1) result in a 16MB max scratchpad size roughly 128x as "memory hard".
The LTC scrypt parameters are sufficient for making sure that GPUs are required to have a lot of high bandwidth memory for decent hashing speed. It's all that matters. Increasing the size of the scratchpad is not going to bring any improvements (if by improvements you mean making CPU mining more competitive). Actually some scrypt based cryptocurrencies tried to make it more "memory hard" and failed to really fend off the GPUs. Also the 128x claim is just silly because you are forgetting that bigger scratchpads also inevitably mean more arithmetic operations involved in a single hash calculation. As I mentioned earlier, it is the balance between the memory speed and the arithmetic calculations speed that is important. And the LTC scrypt somehow managed to get it right, even if this actually happened unintentionally.

Regarding the FGPA device on the picture at the start of this topic. Looks like it is going to have external memory bandwidth roughly similar to what is available for the triple channel DDR3 systems. This is still less than the memory bandwidth of a mid-range GDDR5 equipped GPU. I doubt that this FPGA device is capable of demonstrating any mind blowing hashing speed. Still if it manages to scale well with the lookup-gap increase, have low power consumption and/or low device cost, then it might be possibly competitive.

BTW, the appearance of competitive FPGA devices might make people more motivated to try better optimizing scrypt for AMD GPUs (squeeze every last bit of performance and/or reduce power consumption). Bring it on, this stuff may become fun again Wink
Post
Topic
Board Altcoin Discussion
Re: Can the asic miners mine scrypt currencies ?
by
ssvb
on 18/08/2013, 16:48:33 UTC
Interesting. So the GPU threads stall until the memory read is completed (given that for the full scratchpad, each blockmix cycle needs a 128 bytes read from an address generated by the previous blockmix).
Yes, and the GPU implements something like hyperthreading, but significantly beefed up (not just 2 virtual threads per core as in the CPU, but a lot more). A stalled GPU thread does not mean that the GPU ALU resources are idle, they are just allocated to executing some other threads.

Regarding the bandwidth vs. latency. Fortunately the reads as 128 byte chunks are just perfect for SDRAM memory. SDRAM is generally optimized for large burst reads/writes to do cache line fills and evictions. And the size of cache lines in processors is roughly in the same ballpark (typically even smaller than 128 bytes). Using such large bursts means that the memory bandwidth can be fully utilized without any problems. And the latency can be hidden.
Quote
It makes sense for the huge number of threads available on GPU, but I wonder if this approach works with FPGA too (using external SDRAM). Using internal block RAM to hold the thread state (B/Bo) and switch threads while waiting for the SDRAM. Not sure that works actually. Food for thought, thanks.
There is a software optimization technique called pipelining, which is rather widely used. It allows to fully hide the memory access latency for scrypt. In the Cell/BE miner (which was developed long before the mtrlt's GPU miner) I was calculating 8 hashes at once per SPU core. These hashes were split into two groups of 4 hashes for pipelining purposes. So the second loop, where the addresses depend on previous calculations, looks like this:
Code:
dma request the initial four 128 byte chunks for the first group
dma request the initial four 128 byte chunks for the second group
loop {
    check dma transfer completion and do calculations for the first group
    dma request the next needed four 128 byte chunks for the first group
    check dma transfer completion and do calculations for the second group
    dma request the next needed four 128 byte chunks for the second group
}
The idea is that while the DMA transfer from the external memory to the local memory is in progress, we just do calculations for another group of hashes without blocking. The actual code for this loop is here: https://github.com/ssvb/cpuminer/blob/058795da62ba45f4/scrypt-cell-spu.c#L331. Cell in Playstation3 has enough memory bandwidth headroom (with its total ~25GB/s memory bandwidth) and is only limited by the performance of ALU computations done by 6 SPU cores (or 7 SPU cores with a hacked firmware). So there was no need to implement the scratchpad lookup gap compression for that particular hardware.

I believe that similar pipelining for hiding the latency of external DRAM accesses can be also easily implemented with FPGA or ASIC. But FPGA or ASIC still must have a lot of memory bandwidth even after the scratchpad size reduction, otherwise the external memory will become a performance bottleneck. Beating the GPUs equipped with fast GDDR5 is going to be a tough challenge.
Quote
PS ssvb, you have some very interesting threads linked in your post history. Thank you for posting here, I'm late to this party and this helps enormously.
Well, I'm already away from the party since long ago Smiley
Post
Topic
Board Altcoin Discussion
Re: Can the asic miners mine scrypt currencies ?
by
ssvb
on 16/08/2013, 20:09:41 UTC
However "LTC Scrypt" uses a mere 128KB of RAM.  It all occurs on the GPU die (which has more than enough register space and L2 cache to hold the scratch pad).  GPU memory latency to main memory (i.e. the 2GB of RAM on a graphics card) is incredibly long and the memory latency from GPU die off card to main memory is measured in fractional seconds.  Utterly useless for Scrypt.   If LTC required that to be used, a GPU would be far inferior to CPU with their 2MB+ of L2 and 6MB+ of L3 low latency cache.  "Luckily" the modified parameters selected for LTC use a tiny fraction (~1%) of what is recommended by the Scrypt author for memory hardness even in low security applications and roughly 1/6000th of what is recommended for high security applications.  It makes the scratchpad just small enough to fit inside a GPU and allow significant acceleration relative to a CPU.  

Try bumping the parameters up just a little, GPU performance falls off a cliff while CPU performance is far more gradual.  It doesn't matter if you attempt this on a system with 16GB (or even 32GB) of main memory.  You can even try using a 1GB vs 2GB graphics card with negligible change in performance.  The small memory scratchpad ensures neither a GPU main memory or the computer's main memory is used.  The cache, inside the CPU die for CPU mining, or inside GPU die for GPU mining is what is used.  Ever wonder why GPU accelerated password cracking programs don't include scrypt?  The default paramters make the average GPU execution time <1 hash per second.  Not a typo.  Not 1 MH/s or 1 KH/s but <1 hash per second.

That is why "reaper" was so revolutionary but only for the weakened version of Scrypt used by LTC.  It requires much less memory but still too much memory for a single SIMD unit and GPU main memory has far too much latency.  That makes LTC impossible to mine on a GPU right?  Well people thought so for a year.  Reaper used a workaround by slaving multiple SIMD units together it stores the scratchpad across the cache and registers of multiple SIMD units.  Now this reduces the parallelism of the GPU (which is why a GPU is only up to 10x better than a CPU vs 100x better on SHA-256).  The combined register/cache across multiple SIMD units is large enough to contain the Scrypt scratchpad.  This wouldn't be possible at the default parameters (~20MB of low latency memory) but it certainly possible at the reduce parameters used by LTC.
That's not how scrypt GPU mining works. You are implying that the GPU memory is not used at all, but this is bullshit (just try to downclock the GPU memory and see the effect yourself). You are implying that the memory latency is somehow important, but this is also bullshit. The memory bandwidth is the limiting factor. You are implying that only a single 128K scratchpad is used per whole GPU (or per SIMD unit), but this is also wrong. In fact thousands of hashes are calculated simultaneously and each one of them needs its own scratchpad (of configurable size and not necessarily 128K). You really have no idea what you are talking about.

About the passwords hashing. That's a totally different application of scrypt algorithm and has different requirements. To prevent passwords bruteforcing, you want the calculation of a single hash to be as slow as possible (within reasonable limits, so that verifying passwords does not become too slow). That's why the recommended scrypt parameters are set so high. Just to give you an example, let's imagine that the LTC scrypt parameters are used for hashing passwords. With a GPU you can easily have ~1000 kHash/s LTC scrypt performance, it means that you can try 1000000 different passwords per second for bruteforcing purposes. And for example, when using only lowercase letters and not really long passwords, it's a matter of just seconds or minutes to bruteforce it with such hashing speed. That's why the parameters used for LTC scrypt are not fit for passwords hashing. Check http://en.wikipedia.org/wiki/Password_strength for more information.

However for mining purposes, making a single hash calculation as slow as possible is not a requirement. The absolute hashing speed is irrelevant. The difficulty is adjusted anyway, based on the total cryptocurrency network hashing speed. We just kinda care about the fairness between CPU/GPU/FPGA/ASIC, so that none of them gets a really huge advantage (normalized per device cost or transistors budget). And scrypt performance nicely depends both on the memory speed and on the speed of arithmetic calculations, doing a better job levelling the difference than sha256 from bitcoin.
Post
Topic
Board Altcoin Discussion
Re: LTC miner optimizations for PowerPC (Power Mac) and Cell/BE (PlayStation 3)
by
ssvb
on 21/02/2012, 00:10:57 UTC
I look forward to the performance numbers!  The previously much talked about miner (if it even exists) showed only marginal per-watt performance over CPUs.
There are better forum threads for speculating about the alleged performance numbers. The GPU miner is going to be released with full sources included when/if I get something ready. No other comments for now.
Post
Topic
Board Altcoin Discussion
Re: LTC miner optimizations for PowerPC (Power Mac) and Cell/BE (PlayStation 3)
by
ssvb
on 20/02/2012, 23:07:35 UTC
Finally i can post here Smiley Nice work ssvb to make miner for PS3, i hope we can tweak it together to make it even better Smiley
Sure, more improvements for PS3 miner are definitely possible. I have pushed some of my old unfinished code which implements parts of scrypt in SPU assembly (~5.4 khash/s -> 5.9 khash/s improvement per SPE core) to github. This is only an optimization for the first big loop, handling the second big loop in a similar way is expected to provide about the same speedup. This is just better instructions scheduling and keeping all data in registers avoiding unnecessary spills. Your improvements are focusing on a different aspect - better data layout for less scattered writes and more unrolling to completely eliminate any stray stalls waiting for DMA completion (up to ~6.1 khash/s per SPE core). Combining both optimizations should provide quite good results, maybe even changing to handle 10 hashes at once (5+5) would be beneficial for the second loop. The theoretical peak performance per SPU core, based on counting 128-bit vector ADD/ROL/XOR operations is ~7.35 khash/s (an optimistic estimate, not even taking negligible SHA256 part and other overhead into account). Which means that we are already at >80% of the theoretical peak performance.

Just GPU miners are a bit more hot topic at the moment. During the last weekend I was busy installing a new graphics card and then doing some OpenCL coding with it. But I'm going to revisit Cell/BE code after I get my GPU miner up and running Wink
Post
Topic
Board Altcoin Discussion
Re: Thread about GPU-mining and Litecoin
by
ssvb
on 17/02/2012, 09:35:28 UTC
If you want to believe me, then I can vouch for mtrlt's gpu miner being significantly more efficient than any current cpu miner for scrypt.

From what I know of the gpu miner, option 3 of modifying the scrypt parameter will have minimal impact. The pad size did not seem to matter much, and can be compressed for lack of a better word, with on the fly value reconstruction. So any increase in pad size will have a relatively equal impact on cpu miners until you exceed their cache size, at which point, gpus may become even more efficient.
Right now salsa20/8 is used as a core part of of scrypt (8 rounds for the inner loop). Replacing it with something like salsa20/2 (just two rounds in the inner loop) would significantly improve performance on CPU, because 4x less calculations would be involved. And the memory access pattern would remain the same, resulting in almost no improvement for the miners which depend on memory performance (GPU miners and also to some extent Cell/BE miner). So what's the problem? The variants of salsa20 with lower number of rounds are supposedly significantly less secure, at least in some cases:
http://fse2008.epfl.ch/docs/papers/day_3_sess_3/29_Lausanne_FSE08_camera_ready.pdf
http://www.ecrypt.eu.org/stream/papersdir/2007/010.pdf

But I don't know how exactly this all applies to scrypt, because cryptography is definitely not my forte. That's why I think that bringing up the issue to the scrypt author can make some sense. That is after we get a better idea of realistic GPU performance.

Quote
I think you will be stuck with option 2, finding a completely different hashing algorithm.
Not in an attempt to troll the thread, but if you look at solidcoin's hash code, you will see it has random reads and writes that are of varying size, spread out over a large memory range, and are randomly aligned. These are key techniques in creating havoc with a gpu's memory access methods. I would suggest looking for code that has similar traits if you really want to defeat gpu's or at least keep them on a level playing field with cpus.
This is all nice. But can we be sure that these convoluted hash calculations can't be algorithmically optimized and reduced to something that can run orders of magnitude faster?
Post
Topic
Board Altcoin Discussion
Re: Thread about GPU-mining and Litecoin
by
ssvb
on 17/02/2012, 02:07:24 UTC
1) Figure out if GPU mining litecoins is indeed more efficient. And if so how much better is it.
I guess the only option is to just implement GPU miner and check how fast it runs. It is really surprising that so few people have actually tried this so far.

Quote
3) If we do want to switch, there are a ton of other questions. Can we modify scrypt params or do we need something totally different.
Maybe scrypt author can be contacted and asked about his opinion?

Quote
How far away do we do the algorithm switch? How do we get miners/pools/clients ready for the switch so that there's no downtime?
This is actually interesting. If I understand it correctly, bitcoin itself does not rule out a possible change of hashing algorithm in the future (if the need arises). Attempting this for litecoin now could be treated as some kind of rehearsal and provide a valuable experience.
Post
Topic
Board Altcoin Discussion
Re: LTC miner optimizations for PowerPC (Power Mac) and Cell/BE (PlayStation 3)
by
ssvb
on 16/02/2012, 23:34:06 UTC
The first feedback and actually a patch for the Cell miner sent to me as a private message: https://github.com/shakt1/cpuminer Smiley
But one problem is that shakti is a newly registered user and has no permissions to post anything here yet.

edit: here is his thread in the Newbies section: https://bitcointalk.org/index.php?topic=64222.msg753270#msg753270
If there are more people "on probation" (maybe newcomers from the playstation3 linux community), then it could be a good place for them to post some comments.
Post
Topic
Board Altcoin Discussion
Re: MICROSOFT RUNS A LITECOIN NODE?
by
ssvb
on 13/02/2012, 23:43:48 UTC
It's probably just some random Microsoft employee illegally abusing his office computer to dig some LTC Smiley
Post
Topic
Board Altcoin Discussion
Re: artforz and coblee gpu mining litecoin since the start?
by
ssvb
on 13/02/2012, 06:53:41 UTC
I've just approved a bitcoin bet on the existence of a such a miner:

mtrlt has a fast LTC GPU miner

You are welcome to place your bets.
You can also PM me if any evidence surfaces either way.
If you change the bet to generic "will a LTC miner capable of running 250 kh/s on an 6990 be proven to exist by March 1st 2012?" and allow anyone (not just CoinHunter or mtrlt) to provide the proof, then it could be a bit more fun.
Post
Topic
Board Altcoin Discussion
Re: artforz and coblee gpu mining litecoin since the start?
by
ssvb
on 11/02/2012, 14:12:01 UTC
Well, from talking with Ahimoth on btc-e, sounds like mtrlt managed to find a workaround for the issues I was having with scrypt miner kernels on GPU.
Short version... any of my kernels that got speeds like that also got 100% invalids. Yet the same kernel worked fine on CPU or if I dropped global and/or local worksizes to silly small levels... which of course made it dog slow again...
Do you have your scrypt GPU code available in some public repository? I could probably give it a try.

And I think it might be better to create a separate thread for scrypt GPU mining. This thread just does not feel appropriate, too much of conspiracy theories and fanboyism.
Post
Topic
Board Altcoin Discussion
Re: artforz and coblee gpu mining litecoin since the start?
by
ssvb
on 10/02/2012, 22:31:58 UTC
Bandwidth has nothing to do w/ scrypt.  LATENCY does.  Which is why the amount of L1 cache is so important.
L1 cache is just less important than you think Smiley For example, my scrypt miner optimizations for Cell do not use 256KB of fast local memory at all. It is insufficient for 4x unrolling which is needed in order to eliminate pipeline stalls and at least half of the performance would be lost. But scrypt is not memory heavy enough, so I can easily get away working with the main memory and still have a lot of memory bandwidth headroom. LATENCY is not important in my case, because memory accesses are pipelined, get executed asynchronously and do not block execution. But you can check scrypt_spu_core8 function in the code yourself.

If GPUs have excessive computational resources, then even waiting for memory a lot of time (80% or so per each execution core) is likely not a problem as long as all of them are competing for the precious memory bandwidth and fully saturating it. I did not think about GPU mining earlier just because I did not have any experience with GPU programming and honestly did not expect them to have that much memory bandwidth (more than 10x advantage over Cell).
Post
Topic
Board Altcoin Discussion
Re: artforz and coblee gpu mining litecoin since the start?
by
ssvb
on 10/02/2012, 21:48:41 UTC
I am no GPGPU expert, but I think ArtForz made some very good points in the following thread:
https://bitcointalk.org/index.php?topic=45849.0
CoinHunter could make his claims convincing by simply explaining how to address the GPU limitations outlined by ArtForz.
At least ArtForz was mistaken about Cell earlier Smiley

Just let's do some simple math. Playstation3 has 6 SPE cores, each clocked at 3.2GHz and 25GB/s of total memory bandwidth. Calculating one hash needs approximately 434176 ADD/ROL/XOR operations on 128-bit vectors in the performance critical part of salsa20/8 which are executed in the even pipe (shuffles and the other instructions are executed in the odd pipe). Also calculating one hash needs 256KB of memory bandwidth (128KB is written sequentially, 128KB is read in scattered 128-byte chunks). So taking into account that SPE core can execute one instruction from the even pipe each cycle, the theoretical performance limit based on computational power is (6 * 3200000000) / 434176 ~= 44.2 khash/s. The theoretical performance limit based on memory bandwidth is 25GB / 256KB ~= 95.4 khash/s. There is a lot of headroom for the memory bandwidth and arithmetic calculations are the bottleneck. Though Cell has precise control over memory operations by scheduling DMA transfers and can overlap DMA transfers with calculations. This allows to utilize memory bandwidth very efficiently for scrypt algorithm.

This page seems to say that HD 6990 has 320GB/s of memory bandwidth. And here ArtForz tells us that it is possible to achieve < 20% peak BW with GPU. Doing some math again, we get 320GB * 0.2 / 256KB ~= 244 khash/s. Looks rather believable to me.

edit: corrected HD 6990 memory bandwidth (it is 320GB/s and not 350GB/s)
Post
Topic
Board Altcoin Discussion
Re: Mining on gaming consoles
by
ssvb
on 23/01/2012, 01:35:37 UTC
I'm thinking of getting a PS3 just for litecoin mining.
For LTC mining, OtherOS capable PS3 is needed. And there are no chances of finding one of these in the shops nowadays. But some people may still have PS3 consoles with the older version of firmware.
Post
Topic
Board Altcoin Discussion
Re: Scrypt based coins: Any proof that this algorithm is realy good?
by
ssvb
on 20/01/2012, 08:31:15 UTC
Coz I posted the code which could be simplified using NAND/NOR-optimizations.
Could it really be simplified? That's the main question.

Quote
Assembler edition is very hard to read. I wasn't going to show efficient SIMD implementation.
Yes, yes. We already know that you decided not to show it, but "close researches in this direction". Assuming that the assembler edition was actually efficient and even existed in the first place...

Quote
Look at my function named Salsa(). Each cell from 16-vector uiA is XORed 4*2=8 times. My phrase was

"Imagine what will happen if someone get rid of redundant XORs of 16 temporary memory cells in SALSA..."

It's not about the posted code, it's about NAND/NOR-optimization.
So here we are again. Have you managed to do the alleged NAND/NOR-optimizations? If yes, then how many XORs does it actually eliminate?

Quote
I didn't say that posted code eliminates any XORs.
You lost me here. What are you trying to prove by posting some piece of useless code?

And what makes you think that "Scrypt is good, but [1024, 1, 1] was a bad choice"?
Post
Topic
Board Altcoin Discussion
Re: Scrypt based coins: Any proof that this algorithm is realy good?
by
ssvb
on 19/01/2012, 21:38:31 UTC
Code I posted here is C++ edition of the algo (for better reading). CPU runs assembler code which looks completely different
So what was the purpose of dumping here this garbage C++ edition then? Why not using some pseudocode or intrinsics which would show the idea about the efficient SIMD implementation of scrypt?

Quote
My implementation uses less calculations coz CPU process 3 instructions at once using a trick to avoid register stalls. So extra cycle between 2 others is necessary for "less calculations" and "sequential memory access".
Better instructions scheduling on superscalar processors for making use of triple issue is not using "less calculations", it is more like just doing "the same calculations faster". That's an ordinary performance optimization and has nothing to do with the "quality of Scrypt" and does not help to "get rid of redundant XORs". I had a hope that you have something more impressive to show and was somewhat disappointed.

Quote
Btw, a CPU with 128K L1 per core is quite... uncommon and ur statement about 3 passes is wrong.
You have 3 loops looking like "for (i=0; i<32768; i+=16) { ... }" in your code. The first two of them are walking over the buffer sequentially one after another. And because the buffer is larger than L1 cache ("a CPU with 128K L1 per core is quite... uncommon" as you have noticed), you get L1 cache misses on each of these loops. Merging these first two loops into one would reduce the number of L1 cache misses and also save some load/store instructions. This would actually make the code look more like the reference implementation.

The thing that actually *might* make some sense is to have the data buffers for 4 hashes interleaved. So on the first loop you just run the SIMD code in exactly the same way as scalar code without the need for any shuffle instructions. However this data layout becomes problematic on the last loop because you need to access different memory areas for different hashes ('j' indexes are different for each of the simultaneously processed 4 hashes) and gather scattered data into different lanes of SIMD registers. Introducing the intermediate loop for deinterleaving the data in addition to XORing it would fix the problem. But the data needs to be interleaved again on the last loop. The disadvantage is extra interleaving/deinterleaving overhead. The advantage is no need for shuffles and possibly better instructions scheduling. Is this what your SSE2 code is actually trying to do?
Post
Topic
Board Altcoin Discussion
Re: Scrypt based coins: Any proof that this algorithm is realy good?
by
ssvb
on 19/01/2012, 19:13:34 UTC
OK, thanks for the clarifications. Now we can see that it was just scaremongering and/or handwaving. So much for "do less calculations" and "sequential memory access" promises. In reality the demonstrated code snippet has more XOR operations and does three separate passes over 128K buffer instead of two (which means more L1 cache misses unless the hardware prefetcher saves the day). The pointless replacement of left rotations with right rotations also looks amusing.