Ytterblum -
i remember reading the hashfast chip had 4 dies each of 9x9 mm... thus a total of 324mm² to get a nominal 400 GH/s and a max of 540 GH/s, thus their perf per mm is between 1.23 GH/mm² (nominal) and 1.66 GH/mm² (max overclock and not recommended to be run at this speed, as mentioned in their latest specs)
comparing 40nm to 28nm isn't a linear comparison like you did. its two dimensional, thus maybe you could fit four times (or at least more than two times since 40 isn't a doubling of 28) as many transistors in a similar die area.
also, don't forget that the bitfury chip is 'full custom' which means he laid out his circuit by hand - painstaking work thats not likely to be repeated very often. all the other bitcoin chips are standard cells using someone else's cell library (apart from the vmc/amc, which is an eAsic, and thus can probably be written off in perf and power terms)
-- Jez