Re: Pieter and Greg, Bech32, please

Sorry, I missed this question at the time but thought it was sad to have not followed up on it. -- For posterity:

Quote from: nullius on January 01, 2018, 02:36:52 AM

IIUC, the CPU decades must have crunched alphabets with the NIST visual similarity data referenced in BIP 173 and the error-correcting code to find the alphabet which, per available data, would have the lowest statistical likelihood of undetected or unrecoverable bit errors. (gmaxwell, am I correct in this inference?) Those data were originally gathered from humans; and the resulting address format was tested with humans.

Yes, however we didn't have to jointly make all the decisions at once.

We searched all the possible BCH codes of the relevant size for the ones which the most character errors (5), among those we selected the ones with the lowest false acceptance rate for just beyond 5 errors, then searched the remaining ties (codes that were equally good for character errors) to find the one that was the best at bit errors (which could guarantee detecting up to 6 arbitrarily placed bit errors).

While these searches were ongoing, we used the NIST data make a machine search check all 58905 possible selections of 4 excluded characters out of 36 which had the least internal character confusion, assuming that the input was all uppercase or lower case. It made a decision that I wouldn't have guessed by hand-- e.g. excluding both 1 and i but actually looking at the result confirmed it was a good choice. B can be confused a lot of ways: B looks like 8 and 3, b looks like p or lo, etc. Other guesswork versions looked a lot worse when compared objectively. I don't doubt you could make a somewhat different decision with a different metric or using better input data... but you could also do a lot worse, as I found just from informally picking the chacters to exclude after looking at the NIST data.

[There was an earlier version of the search before we implemented all the NIST data that wanted to exclude 'B', 'L', 'O', 'Q', which I found pretty amusing, but was ultimately relieved that we wouldn't have to convince anyone that we hadn't intentionally targeted the name of a former developers ICO mill company Tongue

...]

For most applications the HRP should be fixed by context (e.g. you don't have a reason intentionally input an arbitrary altcoin address into your bitcoin wallet), which means that any charset error (like entering 1s) can be immediately highlighted in the UI with zero false positives (I am highly dubious about auto-fixing anything since a detectable error may signify carelessness which could result in other non-detectable errors).

Finally we non-exhaustively searched the space of 32! permutations to find a permutation that mapped the remaining most likely character confusions to single bit errors. E.g. q and g are in the charset and differ by only a single bit. likewise c/e, r/t, e/6, 7/l, q/p, S/5, n/m, n/h (the obviousness of these examples depends on your font...), and many other confusable pairs. This mapping means that likely errors for users get mapped to fewer bit errors and since the error detecting code was chosen to have improved detection for single bit errors this means improved detection for errors users are likely to make. This is an optimization with only a small effect, but it is one with fairly low marginal cost since virtually any encode/decode implementation would already use a table due to the missing characters-- just as existing base58 (and base64) (de)coders do. Given that a table is being used, the choice of the order is essentially arbitrary so we were free to choose it to optimize the performance of the error detecting code. I doubt an implementation that includes signing (taking tens of kb of code) would care much about saving a few bytes bytes by using a slow branchy alternative rather than just a simple table in read only memory.

The code searches and permutation searches were computationally expensive but we didn't have to do the product of their efforts, only the sum because they could be searched independently and combined (and the charset search wasn't terribly expensive). Pieter has code for all these searches in his ezbase32 repository...