What thread concurrency were you running? It's absolutely possible to fire up a miner at any value of N with 1-2GB per thread, but if your cards have 4GB of ram on each of them and you are firing them up using only 1GB on each, then it's really not a great test. I'd be happy to create a command line for you that replicates how I run my 4x4GB card system. I couldn't even fire up a single thread allocating 3.5GB memory per card when I had 4 GB system ram. Let me know if you want to test it out.
Just to clarify: how exactly does the buffer size relate to the memory size on the card and the N? If you have a link to something that explains this for noobs I'd appreciate it. Thanks.