Lets do some basic math:
For existing FPGA design the best can be had is 23MHps/J. There is no reason to anticipate an improvement in FPGA power efficiency, yes, there can be marginal reduction of overhead and the FPGA can be scaled up, but it's efficiency will not increase all that much. Based on existing designs we can anticipate 25MH/J for FPGA. There is nothing special abut ASIC, most ASIC vendors just use a custom programmed FPGA; this is called FPGA to ASIC conversion. So at best ASIC will be 50MHps/J;
You are wrong. A real-world SHA-256 130nm chip,
non-optimized for Bitcoin, has already demonstrated 73 Mhash/J:
https://bitcointalk.org/index.php?topic=95762.0 Merely scaling down this non-optimal design from 130nm to 65nm would multiply the Mash/J efficiency by 4 (because efficiency is linearly proportional to the transistor junction area), making it 292 Mhash/J.
Then it is not hard to imagine that optimizing the chip for Bitcoin (ie. two SHA-256 with no high-speed I/O since the same data block is hashed over and over locally, merely incrementing the nonce) would improve the efficiency by a factor or 2 or 3, therefore making it 584 Mhash/J or 876 Mhash/J.
These numbers are not far from BFL's claims (1000 Mhash/J), making them plausible.
And with 2-3 years of design, prototyping as well as about $1-$2 million start-up cost you can do it.