To me, what it really comes down to is probabilities. The popular assumption is that Bitcoin Core has an extremely low probability that it will fork against itself. So, let's go with that premise and say that the probability of Bitcoin Core forking against itself is something like 1 in a 100 trillion (1e14).
Probabilities are not an effective model to reason about bugs in an adversarial environment. (Even in conventional reliability engineering-- without invoking an adversary-- you must first separate systematic failure from random failure in order to reason about failure distributions.)
"If you have a website with a billion pixels and due to a bug clicking one gives the reader one of your bitcoins, and you have one visitor an hour and a normal visitor clicks on 100 pixels, how long until you've lost 100 bitcoins?" .... This doesn't make sense, and thinking about consensus bugs in this manner doesn't make sense. Existing consensus bugs we've seen triggered against implementation were 10^-100 scale events if your model is random monkies pounding on keyboards to produce transactions.
Virtually all the errors we care about are not random failures, and to the extent that there is an element of randomness an attacker can behave adaptively to bring out the otherwise 'rare' behavior that is harmful to your interests.
the consensus tests need to be improved
Thats always true, the tests are woefully inadequate and would continue to be even if they were terrabytes in size. That said, improving them has high value. I'm looking forward to contributions on that front. Probably one of the really important things an alternative implementation can do for the ecosystem is produce new tests as a side effect of their own independent efforts to obtain agreement. (Though, sadly, there seems to have been at least some level of people keeping at least implementation errors private in order to use them for marketing purposes... with all the different ways to monetize a security flaw other than reporting it, we cannot really count on reports.)
But thinking that passing a prefab test suite that someone else wrote alone is sufficient-- which is certainly the impression given there-- sounds like magical thinking. It's not sufficient, merely necessary, and one shouldn't rely on it alone. (If this wasn't the message you were intending to convey, it wasn't clear to me.) Anyone can grind at some tests until they pass but doing, sadly, reduces the predictive power of the tests for errors not present in the tests. (This is partially why I prefer the block tests in Bitcoin runs in CI rather than being used locally by developers; we get a concrete measurement of detectable consensus bugs which are introduced by a change.) Yes, by some definition the tests are inadequate if there is a failure in spite of passing; but they're always going to be inadequate because they cannot be sound, at least not by themselves. (A test coupled with a formal analysis that showed it met some definition of completeness would be interesting.)
There should be a formal specification,
Philosophically there should be, but in terms of benefit to the world the value is quite possibly negative. A non-executable or not widely used specification is a dead letter. The behavior of the participants defines the system. You can point blame all day this way that when some reading of the specification disagrees with some other reading; but its little comfort when all the funds are gone. As a pedagogical artifact a good specification may be highly useful; but it can also cause harm by feeding deep misunderstandings like having software which you believe-does (instead of actually-does) what you believe the specification says (rather than what everyone else actually-does, regardless of what they believe the specification says) is an acceptable situation to be in. "But I followed the spec!" "Too bad, all your coins have been double spent away because you ended up on a different chain from the other participants." Compared to other efforts to improve system integrity I think specification work beyond what exists now has fairly low returns. (Doubly so because existing alternative implementers have frequently made mistakes in parts of the system that are clearly specified and have tests in the normal tests suites.). In Bitcoin the behavior of the system is all there is. You have no other recourse, no higher authority. Being surprised or unhappy with at the Bitcoin network for disagreeing with a ream of paper is, in some ways, like being mad at the universe for not being completely defined by Einsteins' description of general relativity. Like the universe, the Bitcoin network doesn't give a darn about political gamesmanship. It does what it does, not what you think it should do. To the extent that people want to work on specification work, especially if it was specification work that clearly enough defined to permit formal reasoning, it's work I would support-- in spite of my reservations about potential net-negative value from it.