An Avalon mining on P2Pool with even 3 modules does indeed take its CPU to the limit. The only real difference between P2Pool and other pools is that P2Pool's generation transaction is much larger. I'm guessing that the bottleneck is hashing the generation transaction ~16 times per second (though that isn't much...). If that's true, perhaps cgminer could be optimized to compress everything before the Stratum nonce to a SHA-256 midstate?