Polairs (and Tonga) both have 4 channels, with 2 GDDR5 chips per channel, making a total of 8 chips. Each chip does 32-byte burst xfers, so 2 chips provide a single 64-byte cache line. The memory layout switches channels every 256 bytes (4 cache lines).
http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guide/#50401334_pgfId-472173AMD docs say the cards use a direct-mapped cache, which means TLB thrashing can't be the problem since there is no TLB. It sounds a lot like the Pitcairn performance issues as the memory working set grows beyond 1GB (except the issue starts at 2GB with GCN3 devices). I haven't had much time for coding over the past few months, but hopefully I'll have some time over the summer to figure out what's really going on here.
Could this have anything to do with the memory straps? Maybe with stock straps it doesn't slow down with every new DAG