You used a lot of double speak. First I am aware of the space time tradeoff however rather than explain it in every single post it is useful to look at the max scratchpad size. 128KB scratchpad is going to require less memory and less bandwidth than a 16MB scratchpad regardless of what space time tradeoff is employed. A device only has a finite amount of computing power and while you can trade time for space needing less space to start with always helps.
As for higher parameter value having no effect on the relative performance of CPU, GPU, and FGPA/ASICs that is just false. Scrypt was designed to be GPU and specialized device resistant. This is important in password hashing as most servers are using CPU and attacker will likely choose the most effective component for brute forcing. By making CPU performance superior it prevents attackers from gaining an advantage. You can test this yourself. Modify cgminer OpenCL kernel to use a higher p value. Around 2^14 GPU relative performance is essentially gone. It is comparable to a CPU throughput. At 2^16 GPU relative performance is falling far behind. At 2^20 the GPU never completes.
You say one one hand that the memory requirement doesn't matter and on the other hand that FPGA are hard because they need lots of memory and wide buses. Well guess what the higher the p value the MORE memory and wider busses that is needed. At 2^14 roughly 128x the max scratchpad size is going to mean 128x as much bandwidth is necessary. So the lower the p value the EASIER the job is for FPGA and ASIC builders. They can use less memory and narrower busses that means less cost, less complexity, higher ROI%. Sure one isn't required to use max scratchpad size because one can compute on the fly but once again the whole point to the space-time tradeoff is that the advantage to doing so is reduced.
Lastly yes the 128KB is per core but so is the 16MB using the default parameters. If 128KB per core increases memory, bandwidth, and/or die size per core then a 16MB requirement would maker it even harder. So yes the parameters chosen by LTC makes it 128x less memory hard than the default. You use circular logic to say the max scratch pad size is irrelevant because one can optimize the size of the scratchpad to available resources. This doesn't change the fact that due to the space-time tradeoff you aren't gaining relative performance. Using a higher max scatchpad requires either more memory and bandwidth OR requires more computation. The throughput on the FPGA, GPU, CPU is going to be reduced. Now if they were all reduced equally it wouldn't matter all that matters is relative not nominal performance. However the LTC parameters chosen are horrible for CPU usage. CPU have a limited ability for parallel execution. Usually 4 or 8 independent cores. 128KB per core * 8 = 1MB. That's right today with systems that can install multiple GB for very cheap cost the Scrypt paramters chosen bottleneck performance on a CPU. GPU on the other hand are highly parallel execution engines but they have limited memory and that memory is at a higher cost than CPU have access to.
TL/DR
Whatever the relative performance of this FPGA is to a CPU miner it would be WORSE if the p value was higher. LTC decision to use a low p value makes what otherwise would be a nearly impossible task into one which is merely challenging.