If you didn't build the whole thing your own that is, check out a line that has i believe a "224u" line in it assigning a value to memHash..which is padding for a 256 byte chunk..so it should be something that takes it out at way before 255 and bam, well just comment it off and add + 1; How stable is it? i've only ran it an hour... but if it can do ~2-5 hashes for me.. imaging what it may do for multiple not using the basis of bits..
const size_t perThread = hashMemSize + 1u; // +224u; // Seems to be a speed improvement over padding and probably half unstable / more shares.
try a good test on a more speedy video card with that..doing before fix and after fix results and of course will it stay running for an hour or 300?
it's in auto config... source.
are you referring to xmrig ?
Yeah, sorry I didn't specify, it's pretty late here.. and if it's not added as + 1.. it would fit in a 32 byte address.. so nothing would be wasted.. should be an 8 times increase.. at least in memory speed.. if it fits without +1 ..and if there is some need to have 256 byte for a video's memory.. I'll understand, but would like to wonder why.
But if you made your own, and isn't forked or anything from them.. is that how you have it somewhat? maybe 32 or 64?
that is not aligning, that is meta data size added. You probably got better results because you use more free mem by adding 1 than 224

unsigned int size can be max 0xffffffff, so lets take for ex. normal cn mem, its 2mb (0x200000 or 2097152 in decimal), so by adding 224 you get 2097376 and that is far away from unsigned int max

oh and you are talking probably about bits (32 bit, 256bit etc), not bytes.
i tried +1, and nothing changes on rx580 8g, get the same hashrate

of course i may be wrong

It is bits in machine code, but not by memory chunks grabbed.. but any whose.. i did not know it took exactly 2MB for CN.. I've only noticed it in readme files for that's how much a thread takes to add on to memory.. I'm not really talking about how big it can go.. it should be how small can it go and how quickly can it get generated/solved across the whole board, I'm pretty honest here, I do not know that much about the actual algorithm or anything, but turning things into more workable chunks to whatever size can diminish time for things to complete depending on it's assignment..which isn't in just one place in the code.. but after changing that did you get any errors from results? and or notice any raise in memory availability? and yeah even far from normal int.. guess I'm just used to asm and C. and using lowest forms of memory possible for any code improvement.. i always use int8_t or uint_16t from the stdint.h header and C is very unforgiving with memory, but kind of rewarding as well.