Post
Topic
Board Meta
Merits 21 from 7 users
Re: Plagiarism: the difference between "wаllеt" and "wallet"
by
nullius
on 09/03/2018, 20:47:18 UTC
⭐ Merited by Anduck (10) ,malevolent (4) ,dbshck (3) ,LoyceV (1) ,marlboroza (1) ,Timelord2067 (1) ,th3nolo (1)
While processing Bounty Content applications, I stumbled upon something I haven't seen before. I can't really figure out how it's done, but it turns out there's a difference between "wallet" (11,000 hits on Google) and "wallet" (127,000,000 hits on Google)!

You stumbled across a well-known security issue, which usually affects identifiers such as domain names.  E.g., paypal.com vs. paypal.com—see, the same difference!  Or the notorious whole-script confusable, appӏe.com (not the same as apple.com):

Code:
$ echo "apple.com" | hd
00000000  61 70 70 6c 65 2e 63 6f  6d 0a                    |apple.com.|
0000000a
$ echo "appӏe.com" | hd
00000000  d0 b0 d1 80 d1 80 d3 8f  d0 b5 2e 63 6f 6d 0a     |...........com.|
0000000f

First: how is this done? Is there some software that replaces ascii characters by something that looks like it, but can't be found through copy/paste?

Lookalike letters from different scripts such as Cyrillic and Greek are used in lieu of Latin letters.  In this case, U+0430 CYRILLIC LETTER A which UTF-8 encodes to { 0xd0, 0xb0 }:

Code:
$ echo "wallet" | hd
00000000  77 d0 b0 6c 6c d0 b5 74  0a                       |w..ll..t.|
00000009
$ echo "wallet" | hd
00000000  77 61 6c 6c 65 74 0a                              |wallet.|
00000007

Second: I think "wallet" is just the tip of the iceberg, but it's the only word I've checked so far. When I Google "wallet site:bitcointalk.org -imode", it gives me 66 hits.

Tip of the iceberg, indeed.

This exact issue has spawned a plethora of discussion in Unicode TR 39, Internet RFCs (see especially the RFCs related to IDN, among others), and vendor specifications—not to mention, mountains of blog arguments.  I will try to gather up some links for further information.  A quote from UTR #39 below should give a brief overview of the types of confusables.  I will try to answer questions insofar as I reasonably may.


If this is really a thing,phising sites would only become more complicated to identify in the future.
A quick check for the first domain registrar shows that it won't work.

Registries (not registrars) typically have policies about this.  For example, off the top of my head / if memory serves, in .de you can register domains containing äöü but not any other non-ASCII characters.  The purpose of such policies is to prevent this type of attack.



I think this should suffice for an overview:

https://www.unicode.org/reports/tr39/tr39-1.html#Confusable_Detection

Quote from: Unicode Consortium
...there are three main classes of confusable strings:

    X and Y are single-script confusables if they are confusable according to the Single-Script table, and each of them is a single script string according to Section 5. Mixed Script Detection. Examples: "so̷s" and "søs" in Latin.

    X and Y are mixed-script confusables if they are confusable according to the Mixed-Script table, and they are not single-script confusables. Example: "paypal" in Latin and "paypal" with the 'a' being in Cyrillic.

    X and Y are whole-script confusables if they are mixed-script confusables, and each of them is a single script string. Example: "scope" in Latin and "scope" in Cyrillic.