The problem with private fees is one of the reasons why I think that Stratum v2 alone will be not enough. Theoretically, skipping pools entirely is possible, if all nodes will create some kind of consensus when splitting block reward between all miners. I can imagine a system where for example 5,000 miners will share their block headers. That would mean 400 kB per block, but after distributing rewards on-chain that headers could be discarded. Probably, revealing more data will be needed, for example by including the whole coinbase transaction and SPV proof for that. Assuming more detailed proof will take 1 kB per share, it would be 5 MB per block. Assuming that all rewards will mature after 100 blocks, it means 500 MB per node. As far as I know, Stratum protocol allows shares up to 2^16 times easier than the network difficulty. If so, it means 64k shares instead of 5k, so around 6.4 GB per node.
Joining hashes can be done by simply calculating (firstHash*secondHash)/(firstHash+secondHash), the first multiplication of two 256-bit values would take 512 bits, then could be divided by some 256-bit value and the result will fill no more than 256 bits, assuming all hashes having at least 16 leading zero bits and no more than 2^16 shares, they will never overflow. During division, results will be always rounded down, so joining hashes in different order should always yield the same values.
In such decentralized system, even if someone will receive fees in some private way, that miner will simply receive lower reward, because each miner will receive a fraction of its own coinbase transaction, so the more fees that miner will include, the more coins that miner will get after dividing that coinbase reward. To calculate miner's income, the value of the coinbase transaction in satoshis could be multiplied by the target, resulting in the easiest possible target that will grant single satoshi. Then, that target could be divided by miner's hash to calculate number of satoshis (if done inside Lightning Network, smaller units like millisatoshis could be used).
Because interacting with all miners may take a lot of resources when proving that all miners are honest and their shares are valid, that system could be implemented on top of the Lightning Network. In this way, mining nodes will have to validate only two blocks per coinbase transaction, reducing space requirements to 8 MB per block (two blocks fully filled with Segwit transactions will take up to 4 MB each). That means around 800 MB per node. Previously mentioned 1 kB proof would work only when transactions could be joined, so that when calculating coinbase reward it won't be needed to reveal the whole block.