I'm still green when it comes to implementing these brute force cores, but I'm picking up.
I was browsing through your code and one thing struck me; you are using the 256 bit "data" input both for setting the internal state of the first sha round and as the end part of whats hashed in the first round.
In the Icarus and Open-Source-FPGA-Bitcoin-Miner code they have separate 256 bit init state and 96 bits of "data" that is appended to the nonce.
I can't quite work out what those 96 bits are. Bit 64 to 127 of the header. Reversed or not, it doesnt make much sense to me. It should at least not be the same as the init of the hash round.