Ok, it looks like I finally have this fixed. It was just the orphan blocks which were throwing me off, and now those are all taken into account. The parser now successfully reads every single block, transaction, input, and output in the blockchain. There are some output scripts which it fails to derive a valid public key signature for However, in all of those cases I found both blockexplorer and blockchain.info have the same issue, so I'm not treating that is a big concern for now.
I plan to write up a specific blog post detailing all of these 'gotchas' you have to know about to stream/parse the block-chain sequentially (rather than trying to scan all of the blocks into some kind of graph/tree) up front. There are a bunch little annoying things that are not obvious.
[Edit: Spoke too soon, I get off the rail a little further down the blockchain, still have to debug this some more...]
John
I found it easiest to naively commit blockdats into the database as I parse them.
The simplest way to derive the longest branch for me was to represent it as a function applied to the tree which returns a sequence of block entities on the branch. Then I can just cache it, intersect it with blocks to see if they're on it, or apply it to any sub-tree (like the tree at 50 confirmations) when I append new blocks.
Output scripts only happen to usually conform to some common footprints that let you extract addresses and pubkeys, so just store the binary blob in the database which I'm sure you're doing. If I recall correctly, one of the few assumptions you can make about blockdat data is that the reference client does ensure it parses into op-codes, so at least there's that!
The road to robustness is definitely a humbling journey. Hilarious, even.
Keep us updated. I have a lot gotchas etched into comments around my codebase, but "surely I'll blog about them someday!" turns out only to be a compulsive lie I tell myself to feel benevolent.