I decided to do some more optimization before posting this, and it turns out I'm right. I removed all the code I could determine to be unneeded with 2^20 bins, and now I'm getting a ~10% speed improvement. 160/s with 3x R9 380 4GB and 2x R7 370 2GB:
Total 160.7 sol/s [dev0 34.8, dev1 29.0, dev2 28.8, dev3 32.8, dev4 36.5] 5 shares
Total 160.9 sol/s [dev0 34.9, dev1 29.5, dev2 28.9, dev3 32.1, dev4 36.3] 5 shares
edit: After more testing I'd say the speed improvement is more like 15%.
Any chance this kernel will run on my 5970s? They have 1GB each....