My current setup are the R7 250's with 2GB I've mentioned numerous times previously. We've found that the R7 240's with 4GB actually outperform them by about 15% even with a lower clockrate and lower numbers of shaders. This is attributed to the larger memory size allowing for a more aggressive workload. I don't think we've even found their best performance settings yet as we're still trying various combinations. ~2 accepted shares per minute per card for $100 vs ~1.7 for the 250's.
The fact that my scrypt-jane kernels use 4 threads per hash is actually the key for the decent performance of nVidia cards. It allows me to get 4 times the occupancy of the shader hardware, given the tight memory constraints. This... and the lookup_gap implementation I recently added.
In contrast, the currently fastest scrypt (non-jane) kernels run 1 thread per hash: There is no overhead for inter-thread communication and shader occupancy is not an issue considering the 128kb requirement per hash.
Because you are working on the AMD miner code, I think this information might be useful to you.
Christian