With
-D 1 -v 1 -w 256 -aa
I get ~539 MHash/s ... this should be the preferred command line, right?
I'm currently testing other kernels for GCN performance

.
Dia
-D is useless unless you're turning off other cards. -w 256 is default. -v 1 is default -aa does _absolutely nothing_ and I've already renamed it to something else in a local branch.
For the first time in history, no arguments are the best. I have no clue how the hell that happened.
Well it seems the GCN architecture is pretty straightforward in terms of optimisations. You don't have to fiddle around that much with the ordering of commands and that stuff.
By the way, AMD makes the use of amd_bitalign() obsolete for rotations, so it's safe to use rotate. And it seems BFI_INT patching is not needed anymore for 7970 cards, did you observe that, too?
Dia