Guys, I would appreciate it if someone with a 5xxx and 6xxx radeon could test my modified kernel and tell how did his hps change. That would help to understand if the optimal code is different for VLIW and GCN architectures and implement it accordingly.