Search content
Sort by

Showing 20 of 52 results by Phateus
Post
Topic
Board Mining software (miners)
Re: Modified Kernel for Phoenix 1.5
by
Phateus
on 21/01/2012, 20:44:43 UTC
I'm checking back in after being gone for so long...  I just downloaded the 2.6 SDK and it destroys my optimization...  Undecided I will see if there is anything I can do without completely rewriting.  Stay tuned and I should have more info later this week.

P.S. Thanks to everyone who has donated to me in the past, I have been busy lately, but I have not forgotten.

-Phateus
Post
Topic
Board Mining software (miners)
Re: Modified Kernel for Phoenix 1.5
by
Phateus
on 01/10/2011, 22:29:26 UTC
first you shocked me with TWO double hashes  Shocked

but ~3375 integer operations per hash is just perfect  Grin

edit: did you mean 3385??

It's actually closer to 3375 because some VLIW5 instructions only have 4 operations in them.  I can get a more exact number if needed, but its kinda a PITA cuz AMD's software won't actually tell you outright.
Post
Topic
Board Mining software (miners)
Re: Modified Kernel for Phoenix 1.5
by
Phateus
on 29/09/2011, 15:21:15 UTC
are the 1354 ALU OPs for a single SHA256 or for double? as in SHA256(SHA256(Block_Header))

http://bitcoin.stackexchange.com/questions/1293/how-many-integer-operations-on-a-gpu-are-necessary-for-one-hash

network speed guinness world record
https://bitcointalk.org/index.php?topic=38064.0

1354 OPs are for two double hashes.

SHA256(SHA256(Block_Header1)), SHA256(SHA256(Block_Header2))

so, 677 per double hash.

Although, there aren't completely full hashes, since the first and last few rounds (a few %) have optimized out.

Also, each ALU OP is a VLIW5 (very long instruction word) instruction which contains 5 integer operations that run simultaneously, so... depending on how you think about it,

could be ~3375 integer operations or 677 VLIW5 instructions

Hope this helps, let me know if you need any more help with this.  I am interested in how this turns out.
Post
Topic
Board Mining software (miners)
Re: Modified Kernel for Phoenix 1.5
by
Phateus
on 18/08/2011, 17:07:53 UTC
I'm getting hardware errors on phatk 2.2, didn't get them on diapolo's or 2.1

the three are about undistinguishable in terms of speed for me

Are you using BFI_INT?  Of not, there is a bug in the 2.2 kernel, Vince found that in the kernel.cl file, you have to replace

Code:
#define Ch(x, y, z) bitselect(x,y,z)

on line 78 with

Code:
#define Ch(x, y, z) bitselect(z, y, x)

I haven't gotten around to release a new version, but if you make the change yourself, it should fix it.

-Phateus
Post
Topic
Board Mining software (miners)
Re: Modified Kernel for Phoenix 1.5
by
Phateus
on 17/08/2011, 05:51:08 UTC
Seems to me like you've got it all under control, so I'll leave you to finish up. Thanks for your involvement. However I don't want multiple phatk kernels so just replace the current one in-situ and don't bother enumming a different kernel. As for the output code, I prefer to use 4k so feel free to do it your way, but be aware I plan to change it back.

Ok, the source is up... I am trying to figure out how to compile this for windows without the cygwin layer (I really haven't done any of this before... I am soooo lost)...

https://github.com/Phateus/cgminer

ckolivas... if you want to merge this into your code at some point, let me know what I have to do... I literally installed git yesterday, and there is only so much you can learn on the internet in a day ;-)

As for the buffer, my kernel only uses WORKSIZE+1 parts of your buffer, but I left the buffer size intact.
Very good work. Nice of you to figure out how to do git and all as well. Don't worry about the merge, I've taken care of everything and cherry picked your changes as I needed to. I've modified a few things too to be consistent with cgminer's code and there is definitely a significant speed advantage thanks to your changes. Note that if you're ever working on git doing your own changes, do them to a branch that's not called master as you may end up making it impossible to pull back my changes since I won't necessarily take all your code. Thanks again, and I'm sure the cgminer users will be most grateful. Smiley


Ah, that's how that works... good to know.  This whole git seems really useful for working together.  Thanks

-Phateus
Post
Topic
Board Mining software (miners)
Re: Modified Kernel for Phoenix 1.5
by
Phateus
on 17/08/2011, 04:13:55 UTC
Seems to me like you've got it all under control, so I'll leave you to finish up. Thanks for your involvement. However I don't want multiple phatk kernels so just replace the current one in-situ and don't bother enumming a different kernel. As for the output code, I prefer to use 4k so feel free to do it your way, but be aware I plan to change it back.

Ok, the source is up... I am trying to figure out how to compile this for windows without the cygwin layer (I really haven't done any of this before... I am soooo lost)...

https://github.com/Phateus/cgminer

ckolivas... if you want to merge this into your code at some point, let me know what I have to do... I literally installed git yesterday, and there is only so much you can learn on the internet in a day ;-)

As for the buffer, my kernel only uses WORKSIZE+1 parts of your buffer, but I left the buffer size intact.
Post
Topic
Board Mining software (miners)
Re: Modified Kernel for Phoenix 1.5
by
Phateus
on 16/08/2011, 17:49:36 UTC
Alright... I'm getting a little delayed on the prerelease for cgminer... mingw is a pain in the ass..


Yeah, mingw is most certainly a giant PITA.

To compile cgminer with mingw, the trick is to use msys and get pkg-config and libcurl installed properly

For pkg-config, the best is to install this: http://ftp.gnome.org/pub/gnome/binaries/win32/gtk+/2.22/gtk+-bundle_2.22.1-20101227_win32.zip

Once you have that, libcurl is rather easy.

Quote
trying a full cygwin install next...

Mmmh. Not sure this'll get you very far.

If your main dev box is windows and your goal is to integrate
phatk into cgminer, your best bet is probably to install a small
virtual machine (qemu or vmplayer) running ubuntu inside your
windows box and work on cgminer directly on Linux in there.

That's exactly what I do (the other way round) when I have to
try windows-specific things or a piece of code.


Yeah, I think I will stay away from using the mingw environment from now on... Cygwin was easy as pie.  No issues, I think can cross compile from cygwin using mingw if I want native Win32 support.  Apparently, getting pkg-conf (i think) working without POSIX support is terrible.  I got my kernel working around 5am last night linking against the cygwin dlls.. so tonight I will release the changes when I get home.

Alright... I'm getting a little delayed on the prerelease for cgminer... mingw is a pain in the ass.. trying a full cygwin install next...

Bear with me, hopefully I'll get it running tomorrow.

-Phateus
You could just tell me what to do to interface it with cgminer (i.e. what new variables you want) and I'd copy most of your kernel across. Only the return code and define macros are actually different in cgminer in the kernel itself.

Yeah, if you want, I can send you the changes tonight so you can put it in your release.  The only modifications I had to make to the kernel is changing VECTORS to VECTORS2 , hardcoding OUTPUT_SIZE = 4095 and hardcoding WORKSIZE=256 (I really do need this passed to the kernel though).  Also, my kernel only uses WORKSIZE+1 entries in the buffer, it would be better if you made the buffer that size.

As for the changes in the miner, I think I only had to change the precalc_hash() function, the kernel input and output file name, queue_phatk_kernel() function
what I will do tonight, is add KL_PHATK_2_2 to the cl_kernel enum and copy the function code and add the corresponding command line argument (right now I have just replaced PHATK with mine) and add -DWORKSIZE= arguments for the kernel.

Anyway, I will give you more details tonight when I am in front of my code.
My fork is https://github.com/Phateus/cgminer, I will upload the changes tonight (as soon as I figure out git... never used that before)

-Phateus

P.S. thanks for the easy to read code Smiley
Post
Topic
Board Mining software (miners)
Re: Modified Kernel for Phoenix 1.5
by
Phateus
on 16/08/2011, 02:35:42 UTC
Alright... I'm getting a little delayed on the prerelease for cgminer... mingw is a pain in the ass.. trying a full cygwin install next...

Bear with me, hopefully I'll get it running tomorrow.

-Phateus
Post
Topic
Board Mining software (miners)
Re: Modified Kernel for Phoenix 1.5
by
Phateus
on 15/08/2011, 17:54:59 UTC
It seems like your latest kernel and mine have problems if BFI_INT gets forced of via (BFI_INT=false) ... it seems the results are invalid every time.
Any idea Phateus?

Perhaps #define Ch(x, y, z) bitselect(x, y, z) is not right?

Edit and solved, non BFI_INT Ch has to be:
Code:
#define Ch(x, y, z) bitselect(z, y, x)

If you want to thank someone, you can donate to 1LY4hGSY6rRuL7BQ8cjUhP2JFHFrPp5JVe (Vince -> who did a GREAT job during my kernel development)!

Dia

Awesome, thank you!  I was under the assumption that BFI_INT and bitselect were the same operation, apparently, the operand order is different.  I will fix it in my next release.

Thank you everyone for your support (both in BTC and discussion).

I should have a drop-in version of the kernel available for cgminer soon, so anyone wanting to try out the pre-release, I'll be posting it tonight.

@BOARBEAR
*sigh*.... come on man... do you even read my posts? There is no single cause of the bad performance.  2.2 executes less instructions and uses less registers than 2.1, but as I said... there is some weird issue which makes openCL slower behind the scenes.  My best guess is that it has to do with register allocation. 

The GPU has a total of 256x32x4 registers (8192 UINT4).  At the most, there are 256 threads per workgroup (8192/256 = 32 registers per thread).  Using VECTORS, the number of registers is far below this number, therefore the hardware can operate on the maximum allowable threads at a time.  However, when you compile with VECTORS4, there is more than 32 registers per thread.  OpenCL must determine how to allocate the threads, and the utilization of the video card is sub-optimal)  Below is a diagram of what I think is going on.


4 thread groups running simultaneously VECTORS (2 running at a time)
[1111111122222222]
[3333333344444444]

using an optimal version of VECTORS4, it would look much like this (double the work is done per thread)
[1111111111111111]
[2222222222222222]
[3333333333333333]
[4444444444444444]

now making it use slightly less resources will make it slower because the threads are out of sync and there will be overhead in syncing and tracking data within threadgroups:
[1111111111111112]
[2222222222222233]
[3333333333333444]
[4444444444445555]

Now, I may be waaaaay off here, but something like this is what makes sense to me.  Especially, since this would explain why decreasing the memory actually improves performance in some cases (by forcing synchronization).

Anyway, enough of my off-topic analysis...


I will release a version that will work with cgminer early next week (looks like he has already implemented diapolo's old version).


Looking forward to this !!

Just sent one coin your way, and there's another once the work is done.

Quote
We are hitting a ceiling with opencl in general (and perhaps with the current hardware).  In one of the mining threads, vector76 and I were discussing the theoretical limit on hashing speeds... and unless there is a way to make the Maj() operation take 1 instruction, we are within about a percent of the theoretical limit on minimum number of instructions in the kernel unless we are missing something.

Out of curiosity, have you looked into trying to code a version
directly in AMD's assembly language and bypassing OpenCL entirely ?
(I'm thinking: since we're already patching the ELF output, this seems
like the logical next step Smiley)

Also, have you looked at AMD CAL ? I know this is what ufasoft's miner
uses (https://bitcointalk.org/index.php?topic=3486.500), and also what
zorinaq considers the most efficient way to access AMD hardware (somwhere
on http://blog.zorinaq.com)



Replacing one instruction in the ELF with another that uses the exact same inputs/outputs is one thing, but manually editing the ASM code is another thing entirely. Besides, with the work that has been done the GPU is already at >99% of the theoretical maximum throughput. (ALU packing) And as said above, we are also close to the theoretical minimum number of instructions to correctly run SHA256.

Also, if you look near the end of the hdminer thread you will notice that users are able to get the same hashrates from phatk on 69xx. For 58xx and other VLIW5 cards phatk is significantly faster than hdminer. If that's the best he can do with CAL then I don't see any reason to use it. hdminer had a substantial performance advantage back in March/April, but with basically every miner supporting BFI_INT this is no longer the case.

Agreed, the kernel itself is pretty optimal.  I might look into calling lower level CAL functions to manage the (OpenCL compiled) GPU threads (instead of using openCL), but I doubt this will give any speedup (although, I might be able to reduce the CPU overhead).
Post
Topic
Board Mining software (miners)
Re: Modified Kernel for Phoenix 1.5
by
Phateus
on 12/08/2011, 20:01:13 UTC
I took a look at the comparison between version 2.2 and version 2.1
could it because __constant uint ConstW[128] change that broke VECTORS4?

That change is inconsequential (I was trying some things that required the change but did not keep them).. the compiler doesn't use those values, so they code should be exactly the same doing it either way (you can try and replace the code with the old code if you want to check).

You keep saying that it is broken.. if it does not run, post the errors.

I have found that on my card, VECTORS4 is much slower in version 2.2 than 2.1, but this is not a bug... it seems to be because openCL does not like allocating that many registers... Version 2.1 uses around 99.7% of instruction slots with VECTORS4 and I have tried many many ways to make it faster and more reliable (in 2.1), but I have given up on it.  It is still in the release because I don't see any point in taking it out...  but getting 2.2 to run as fast as 2.1 with VECTORS4 is not going to happen.  Also, the differences between 2.1 and 2.2 with VECTORS are very tiny anyway (less than .5%)...

Getting into more detail about it: If you look at the graph on the main page of the thread, you can see the graph of VECTORS4 in version 2.1... in version 2.2 for some reason, the spike (and corresponding valley) is located higher (somewhere around 500), this could mean that it would be just as fast if you had 1500 Mhz memory, but I have no idea why openCL reacts this way to changing the memory speed.  There are way to many GPU architecture/GPU bios/PCIe bus/CPU-GPU transfer/driver/openCL implementation unknowns to try to predict this behavior.


-Phateus
Post
Topic
Board Mining software (miners)
Re: Modified Kernel for Phoenix 1.5
by
Phateus
on 12/08/2011, 18:30:17 UTC
Sent another donation your way.  Look forward to your work on cgminer.

Thanks Cheesy
Post
Topic
Board Mining software (miners)
Re: Modified Kernel for Phoenix 1.5
by
Phateus
on 12/08/2011, 17:53:22 UTC


What I am currently working on is a modified version of phoenix which runs multiple kernels with a single instance and a single work queue (to decrease excessive getwork).
I am also working on plugin support for it, so you can use various added features (such as built-in gui, Web interface, logger, autotune, variable aggression for when computer is idle, overclocking support, etc...)
This would make it tremendously easier for anyone to add features and you can still use whichever kernel works best for you.

As for cgminer support, I haven't tried it, are there any benefits over phoenix?  I may fork that instead of phoenix and make the plugin support via command-line, lua or javascript, although I find that python is much easier to code than c (especially for cross platform support).

Would definitely be interested in a cgminer fork.  Don't get me wrong, phoenix is great and has always given me the best performance overall but it does lack some of the more refined features, which the other poster listed above.  Failover and nice static but updated command line "UI".  Seems like you and diapolo are hitting the ceiling with phoenix anyway.

I will release a version that will work with cgminer early next week (looks like he has already implemented diapolo's old version).

We are hitting a ceiling with opencl in general (and perhaps with the current hardware).  In one of the mining threads, vector76 and I were discussing the theoretical limit on hashing speeds... and unless there is a way to make the Maj() operation take 1 instruction, we are within about a percent of the theoretical limit on minimum number of instructions in the kernel unless we are missing something.

Now that doesn't mean that there is NO room for improvement, just that any other improvement will probably have to be faster hardware, a more efficient implementation of openCL by AMD or figuring out a better way to finagle the current openCL implementation to reduce the implementation overhead.  But, unless there is a problem with pyopenCL, c and python should give equivalent speeds as long as they are just calling the openCL interface (the actual miner uses negligible resources).  I suppose it could be possible to access the hardware drivers directly and run the kernel that way... but I don't see that as being feasible.

But, with all of that said, I have looked through some of his code, and it some really clean code.  Part of the reason I want to add these features is to learn more python (this is the first thing I have programmed in python), but it probably will just be easier modifying the cgminer code.  Thanks for pointing out cgminer to me Smiley
Post
Topic
Board Mining software (miners)
Re: Modified Kernel for Phoenix 1.5
by
Phateus
on 11/08/2011, 16:50:32 UTC
As of version 2.1, phatk now has command line option "VECTORS4" which can be used instead of "VECTORS".
This option works on 4 nonces per thread instead of 2 and may increase speed mainly if you do not underclock your memory, but feel free to try it out.  Note that if you use this, you will more than likely have to decrease your WORKSIZE to 128 or 64.

I'm using a 6770 @ 1.01Ghz with phatk 2.2.  When I run the memory clock at 300Mhz with the VECTORS option, I get 234.5Mhps.  However, I can't seem to reap the benefits of VECTORS2 or VECTORS4 at a higher memory clock (i.e. 1.2Ghz).  I've reduced the WORKSIZE from 256 to 128 and 64 and can only seem to peek at 213Mhps.  With these options, I can only achieve between 204 and 213 Mhps.

I have found that VECTORS4 is extremely unreliable... even tiny changes in the kernel and other factors affect the hashrate tremendously...  OpenCL gets really weird when you use a lot of registers.  I added it in 2.1 because it was comparable to VECTORS in some situations, but changing the kernel slightly in 2.2 seems to have broken it (even though kernel analyer says it uses less registers and less ALU ops... *sigh*)

Anyone wondering about any new kernel improvements, I seem to be at a standstill... I have tried the following:
  • Removing all control flow operations (about 1MH/s slower)
  • Sending all kernel arguments in a buffer (about 1MH/s slower)
  • Using an atomic counter for the output so that the output buffer is written sequentially (about the same speed and only works on ATI xxx cards and newer)
  • Using an internal loop in the kernel to process multiple nonces (Either significantly slower or massive desktop lag)
  • Calling set_arg only once per getwork instead of once per kernel call (only faster when using very low aggression and FASTLOOP, I will add this to my next kernel release)

-Phateus
Post
Topic
Board Mining software (miners)
Re: Modified Kernel for Phoenix 1.5
by
Phateus
on 11/08/2011, 16:33:14 UTC
Just did a test:

Rig setup:
  Linuxcoin v0.2b (Linux version 2.6.38-2-amd64)
  Dual HD5970 (4 GPU cores in the rig)
  Mem clock @ 300Mhz
  Core clock @ 800Mhz
  VCore @ 1.125v
  AMD SDK 2.5
  Phoenix r100
  Phatk v2.2
  -v -k phatk BFI_INT VECTORS WORKSIZE=256 AGGRESSION=11 FASTLOOP=false

Result:
  Overall Rig rate: 1484 MH/s
  Rate per core: 371 MH/s

This is ~4MH/s faster than Diapolo's latest.

On 5970, phatk 2.2 is current king of the hill.

For the world to be perfect, this kernel needs to be integrated into cgminer Smiley



The last kernel releases show, that it is a bit of trial and error to find THE perfect kernel for a specific setup. Phaetus and I try to use the KernelAnalyzer and our Setups as a first measurement, if a new Kernel got "faster". But there are many different factors that come into play like OS, driver, SDK, miner-software and so on.

I would suggest that we should try to create a kernel which is based on the same kernel-parameters for phatk and phatk-Diapolo so that the users are free to chose which kernel is used. One thing is CGMINER kernel uses the switch VECTORS2, where Phoenix used only VECTORS (which I changed to VECTORS2 in my last kernel releases). It doesn't even matter to use the same variable names in the kernel (in fact they are different sometimes) as long as the main miner software passes the awaited values in a defined sequence to the kernel.

Dia

A good idea.

A further improvement: I'd like to have an option in my miner that spends ~2mn
benchmarking all the kernels available in the current directory (without talking to
a pool, i.e. doing pure SHA256 on bogus nonces), and picking the fastest for the
current rig.

For people with lots of different rigs/setups, that would save them the headache
of having to hand-tune each instance.


What I am currently working on is a modified version of phoenix which runs multiple kernels with a single instance and a single work queue (to decrease excessive getwork).
I am also working on plugin support for it, so you can use various added features (such as built-in gui, Web interface, logger, autotune, variable aggression for when computer is idle, overclocking support, etc...)
This would make it tremendously easier for anyone to add features and you can still use whichever kernel works best for you.

As for cgminer support, I haven't tried it, are there any benefits over phoenix?  I may fork that instead of phoenix and make the plugin support via command-line, lua or javascript, although I find that python is much easier to code than c (especially for cross platform support).
Post
Topic
Board Mining software (miners)
Re: Modified Kernel for Phoenix 1.5
by
Phateus
on 10/08/2011, 16:21:30 UTC
Why not make two separate kernels then?

VECTORS4 might one day be the better alternative, instead of doing all that work then why not start now and keep pace?



Because I have literally put in over 100 hours on the main kernel and have gotten almost nothing in donations.  I just don't have the time to keep up with two kernels.  If anyone feels like making a VECTORS4 branch, go for it... the source code is in the public domain and you can use how you'd like.  Wink

Also, from what I've gathered, there may be only 1 or 2 people interested it... If you can lower your memory speed, I think VECTORS will always be faster than VECTORS4.

Now, I do like hearing feedback from everyone. I am just letting you know that it is not feasible to optimize the kernel for every possible configuration (SDK 2.1, 2.4, slow memory).  Right now, the kernel is optimized for SDK 2.5 and the 68xx and 5xxx cards and assuming you pick the best memory clock speed for your card (somewhere around 1/3 of your core clock).

-Phateus
the thing is, VECTORS4 worked perfectly for me in version 2.1
in version 2.2 its broken

As in it doesn't work at all, or that it is much slower?... Just use version 2.1 then
Post
Topic
Board Mining software (miners)
Re: Modified Kernel for Phoenix 1.5
by
Phateus
on 09/08/2011, 22:43:16 UTC
Why not make two separate kernels then?

VECTORS4 might one day be the better alternative, instead of doing all that work then why not start now and keep pace?



Because I have literally put in over 100 hours on the main kernel and have gotten almost nothing in donations.  I just don't have the time to keep up with two kernels.  If anyone feels like making a VECTORS4 branch, go for it... the source code is in the public domain and you can use how you'd like.  Wink

Also, from what I've gathered, there may be only 1 or 2 people interested it... If you can lower your memory speed, I think VECTORS will always be faster than VECTORS4.

Now, I do like hearing feedback from everyone. I am just letting you know that it is not feasible to optimize the kernel for every possible configuration (SDK 2.1, 2.4, slow memory).  Right now, the kernel is optimized for SDK 2.5 and the 68xx and 5xxx cards and assuming you pick the best memory clock speed for your card (somewhere around 1/3 of your core clock).

-Phateus
Post
Topic
Board Mining software (miners)
Re: Modified Kernel for Phoenix 1.5
by
Phateus
on 09/08/2011, 20:33:43 UTC
I found that VECTER4 option does not work for version 2.2



I optimize the code for VECTORS, so probably making it faster in 2.2 made VECTORS4 slower.  I can't really optimize the kernel for both, so I would just stick with version 2.1 if that is faster for you.

And everyone, thanks for your support, every little bit helps Smiley
Post
Topic
Board Mining software (miners)
Re: further improved phatk OpenCL Kernel (> 4% increase) for Phoenix - 2011-08-04
by
Phateus
on 09/08/2011, 20:20:46 UTC


1. AND the low 16-bits of H against the high 16 bits
2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits
3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits
4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits
5. Take the resulting 2-bit number and NOR the first bit against the second bit

6. do bitwise AND of the resulting 1-bit number against the nonce
7. take the result from #6 and XOR the low 16-bits against the high 16-bits
8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits
9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce

Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)


I don't claim to understand this, but step (1) should be an OR, not an AND.


Yeah that's right. Must have missed that when I went over the post. I had it correct in the example though.

I tried to implement this, but the kernel only crashes the display driver THAT hard, I get a Bluescreen everytime ... weird.

Code:
// Round 124
Vals[7] += Vals[3] + P4(124) + P3(124) + P1(124) + P2(124) + ch(124) + s1(124) + H[7];

...

// lo 16 Bits OR hi 16 Bits
uint positive = (Vals[7].x & 0x0000FFFFU) | (Vals[7].x & 0xFFFF0000U);
// lo 8 Bits OR hi 8 Bits
positive = (positive & 0x00FFU) | (positive & 0xFF00U);
// lo 4 Bits OR hi 4 Bits
positive = (positive & 0x0FU) | (positive & 0xF0U);
// lo 2 Bits OR hi 2 Bits
positive = (positive & 0x3U) | (positive & 0xCU);
// lo 1 Bit NOR hi 1 Bit
positive = ~((positive & 0x1U) | (positive & 0x2U));

// nonce AND positive
uint position = W_3.x & positive;
// lo 16 Bits XOR hi 16 Bits
position = (position & 0x0000FFFFU) ^ (position & 0xFFFF0000U);
// lo 8 Bits OR hi 8 Bits
position = (position & 0x00FFU) | (position & 0xFF00U);

output[position] = W_3.x;

Dia

You need to shift the the bits for each stage:

For example, oring the top bits to the bottom bits should be:

Code:
uint positive = (Vals[7].x & 0x0000FFFFU) | ((Vals[7].x & 0xFFFF0000U) >> 16);
or just:
Code:
uint positive = (Vals[7].x & 0x0000FFFFU) | (Vals[7].x >> 16);
because the upper 16 bits will already be 0 because of the shift;

Otherwise, you will just get the original Vals[7] value;
if you want to do it that way, the code would be:
Code:
uint positive = (Vals[7].x & 0x0000FFFFU) | (Vals[7].x >> 16);
// lo 8 Bits OR hi 8 Bits
positive = (positive & 0x00FFU) | (positive >> 8);
// lo 4 Bits OR hi 4 Bits
positive = (positive & 0x0FU) | (positive >> 4);
// lo 2 Bits OR hi 2 Bits
positive = (positive & 0x3U) | (positive >> 2);
// lo 1 Bit NOR hi 1 Bit
positive = ~((positive & 0x1U) | (positive >> 1));

However, similar to what I said earlier, the following code does the same thing:
Code:
uint positive = 0xFFFFFFFF + min(Vals[7], 1u);
if Vals[7] ==0, then min(Vals[7], 1u) == 0, otherwise it equals 1
0xFFFFFFFF + 0 = 0xFFFFFFFF
0xFFFFFFFF + 1 = 0


oh yeah...  you are getting blue screens because your address would be a random 32 bit number and it was probably trying to access memory that your video card doesn't have
Post
Topic
Board Mining software (miners)
Re: further improved phatk OpenCL Kernel (> 4% increase) for Phoenix - 2011-08-04
by
Phateus
on 08/08/2011, 22:43:00 UTC
The steps:

1. AND the low 16-bits of H against the high 16 bits
2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits
3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits
4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits
5. Take the resulting 2-bit number and NOR the first bit against the second bit

6. do bitwise AND of the resulting 1-bit number against the nonce
7. take the result from #6 and XOR the low 16-bits against the high 16-bits
8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits
9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce

Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)

Steps 7-8 are to produce an 8-bit index that is 0 for all invalid nonces and hopefuly unique for each valid nonce assuming there are a small number of valid nonces. However in the worst case (more than 1 hash found in a single execution) at least 1 will be returned. However if 3 or less nonces are found per execution all of them should be returned in most cass.


Sorry to jump in in the middle of the conversation, but if I understand what you are trying to do...
Can't you just replace all of the steps  with:
Code:
Valid = 1 - min(H, 1u);
Nonce = W[3];
OUTPUT[((Nonce & OUTPUT_MASK) + 1) * Valid] = Nonce;
if you are trying to remove all control flow?  Any invalid nonce will be written into Output[0] and the valid nonces will be randomly distributed through the rest of the array.

I really don't know how the architecture handles having 4 billion threads writing to the same address, but... you may want to try it out...

Also, it is easy enough to make it work with VECTORS ;

Code:
Valid = 1 - (min(H.x, H.y), 1u);
//If .y is valid, add 1 to the nonce.
Nonce = W[3].x + min(H.y, 1);
OUTPUT[((Nonce & OUTPUT_MASK) + 1) * Valid] = Nonce;
(or you could just double the code for .x and .y)

OR
Code:
Valid = 1 - (min(H.x, H.y), 1u);
//If .y is valid, add 1 to the nonce.
Nonce = W[3].x;
OUTPUT[((Nonce & OUTPUT_MASK) + 1) * Valid] = Nonce;
and have the __init__ file check both Nonce and Nonce+1


another way of doing it would be (the compiler should replace the if statement with a set conditional):
Code:
Nonce = W[3];
Position = W[3] & OUTPUT_MASK;
if(H)
   Position = OUTPUT_MASK + 1;
//Invalid nonce are at the last position of the array, valid are distributed at the front
OUTPUT[Position] = Nonce;

Slightly faster would be to have the Position = the local thread # (since you save an &) and make sure that the size of the output* array is WORKSIZE + 1:
Code:
Nonce = W[3];
Position = get_local_id(0);
if(H)
   Position = WORKSIZE + 1;
OUTPUT[Position] = Nonce;

EDIT:  Ooh, just thought of something else: 

If it doesn't like writing everything to the same address: Make the buffer size = 2*WORKSIZE...
Code:
Nonce = W[3];
Position = get_local_id(0);
if(H)
   Position += WORKSIZE;
OUTPUT[Position] = Nonce;
Then all of the threads in a workgroup will write to a different address.  The valid nonces will be in the first half, and the invalid will be in the second.

Now I have no idea if any of these things would be faster, but I think all of them would work...

Sorry to put so much code down... but this kind of coding isn't really an exact science...
Post
Topic
Board Mining software (miners)
Re: Modified Kernel for Phoenix 1.5
by
Phateus
on 08/08/2011, 18:17:18 UTC
No update yet? It's aug 8 now <.<

Just, posted the new version, enjoy.