I had not much time for testing due to urgent DCR fix, just noticed slowdown for Polaris on Linux. It is because OpenCL reports that it has 14 compute units (CUs) instead of 36. You can fix it by using undocumented option: specify "-eqlim 288". 288 = CUs * 8.
that does the trick for now. thx