I'm experimenting a very different implementation,
JCE Hybrid Miner, with the AES job done by the CPU in a split second (because an AES CPU is basically an AES one-cycle asic) and only the scratchpad done by the GPU, with the requirement the CPU is aes-compatible, and the GPU memory relocated is above 4G (option in most BIOSes).
If it works, performance should skyrockets

The bottleneck is the synchronization, which is fine on OpenCL for FPGA (like Intel or Xilix) but not on AMD GPUs.
You (or someone else with proper skills) need to bake a proper x16r miner for AMD. Anything so far has been grossly underwhelming.