While processing Bounty Content applications, I stumbled upon something I haven't seen before. I can't really figure out how it's done, but it turns out there's a difference between "wallet" (11,000 hits on Google) and "wallet" (127,000,000 hits on Google)!
You stumbled across a well-known security issue, which usually affects identifiers such as domain names.
E.g., paypal.com vs. paypal.comsee, the same difference! Or the notorious
whole-script confusable, appӏe.com (
not the same as apple.com):
$ echo "apple.com" | hd
00000000 61 70 70 6c 65 2e 63 6f 6d 0a |apple.com.|
0000000a
$ echo "appӏe.com" | hd
00000000 d0 b0 d1 80 d1 80 d3 8f d0 b5 2e 63 6f 6d 0a |...........com.|
0000000f
First: how is this done? Is there some software that replaces ascii characters by something that looks like it, but can't be found through copy/paste?
Lookalike letters from different scripts such as Cyrillic and Greek are used in lieu of Latin letters. In this case, U+0430 CYRILLIC LETTER A which UTF-8 encodes to { 0xd0, 0xb0 }:
$ echo "wallet" | hd
00000000 77 d0 b0 6c 6c d0 b5 74 0a |w..ll..t.|
00000009
$ echo "wallet" | hd
00000000 77 61 6c 6c 65 74 0a |wallet.|
00000007
Second: I think "wallet" is just the tip of the iceberg, but it's the only word I've checked so far. When I Google "wallet site:bitcointalk.org -imode", it gives me 66 hits.
Tip of the iceberg, indeed.
This exact issue has spawned a plethora of discussion in
Unicode TR 39, Internet RFCs (see especially the RFCs related to IDN, among others), and vendor specificationsnot to mention, mountains of blog arguments.
I will try to gather up some links for further information. A quote from UTR #39 below should give a brief overview of the types of confusables. I will try to answer questions insofar as I reasonably may.
If this is really a thing,phising sites would only become more complicated to identify in the future.
A quick check for the first domain registrar shows that it won't work.
Registries (not registrars) typically have policies about this. For example,
off the top of my head / if memory serves, in .de you can register domains containing äöü but not any other non-ASCII characters. The purpose of such policies is to prevent this type of attack.
I think this should suffice for an overview:
https://www.unicode.org/reports/tr39/tr39-1.html#Confusable_Detection...there are three main classes of confusable strings:
X and Y are single-script confusables if they are confusable according to the Single-Script table, and each of them is a single script string according to Section 5. Mixed Script Detection. Examples: "so̷s" and "søs" in Latin.
X and Y are mixed-script confusables if they are confusable according to the Mixed-Script table, and they are not single-script confusables. Example: "paypal" in Latin and "paypal" with the 'a' being in Cyrillic.
X and Y are whole-script confusables if they are mixed-script confusables, and each of them is a single script string. Example: "scope" in Latin and "scope" in Cyrillic.