But it is not an audit of the silicon - it is just a test of an entropy of the data that it outputs.
Just one minor nitpick. We're not testing the hardware RNG directly, we're testing the final output of /dev/random when it's being continuously fed entropy from the hardware RNG. /dev/random has it's own whitening algorithms and doesn't JUST use the entropy provided from the hwrng, it pulls from a number of other places as well.
I agree with you that using JUST the output from the hwrng is unsafe. So that's why we feed the entropy into /dev/random and don't use it directly. Our test is testing the system as a whole.
Your points are certainly valid though.