Post
Topic
Board Services
Re: Quick programming bounty: anti-phishing regex - 0.2 BTC
by
dudeami
on 10/11/2013, 04:08:06 UTC
Hey, have a pure RegExp solution that will work in most circumstances.

Code:
$regex = <<<'REGEXP'
@\[url=                         # Start of URL BBCode
(                               # Group 1
((?:https?:\/\/)?)            # Group 2, capture protocol if it exists
([\da-z\.-]+)                 # Group 3, capture the hostname, without the TLD
\.                            # Need a period between the hostname and TLD
([a-z\.]{2,6})                # Group 4, TLD
((?:[\/\w \.-]*)*\/?)         # Group 5, path
)
\]
\s*?                            # Multiline support
(                               # Group 6
(?!.*\3.+\4|.*\[img\].*)    # Lookahead to check for the non-phishing domain and images
.*?
((?:https?:\/\/)?)            # Group 7, phishing URL protocol
(                             # Group 8, phishing URL host
(?:[\da-z\.-]               # Match any characters normally found in a URL
|                           # or
\[[^\]]+\]                  # Match any BBCode
|                           # or
[^\x00-\x7F])+              # Match any unicode characters
)                            
(?:\.|[^\x00-\x7F]+)          # Need a period, but also look for unicode characters
([a-z\.]{2,6})                # TLD
((?:[\/\w \.-]*)*\/?)         # Path
.*?
|                             # or
[^ ]+\[img\].*\[\/img\][^ ]+  # An image with anything other than space spaces surrounding it
)
\s*?                            # Find any whitespace inbetween
\[/url\]                        # End of URL BBCode
@xmi
REGEXP;

This will fail if any unicode characters are used inside of words, but other than that should be selective enough. Example of this failing at the bottom of the post.

Edit: Will also fail when replacing the legit URL's "." with " " (or any other non-alphanumeric, non-unicode character) and having a phishing site for the URL. Expanding the TLD sections to look for real world TLDs could fix this issue to an extent.


success
Code:
[url=http://phishing.com]http://safe-site.com/login.php[/url]
Code:
[url=http://phishing.com]http://phishing.com[/url]
success
Code:
[url=phishing.com]
safe-site.com[/url]
Code:
[url=phishing.com]phishing.com[/url]
success
Code:
[url=http://phishing.com]http://safe-site.com/login.php[/url][nobbc]http://safe-site.com/login.php[/url][/nobbc]
Code:
[url=http://phishing.com]http://phishing.com[/url][nobbc]http://safe-site.com/login.php[/url][/nobbc]
success
Code:
[url=http://phishing.com]Welcome to safe-site.com![/url]
Code:
[url=http://phishing.com]http://phishing.com[/url]
success
Code:
[url=http://phishing.com]safe-site.com[/url]
Code:
[url=http://phishing.com]http://phishing.com[/url]
success
Code:
[url=http://phishing.com][b]safe[/b]-site.com[/url]
Code:
[url=http://phishing.com]http://phishing.com[/url]
success
Code:
[url=http://phishing.com]safe-site.io[/url]
Code:
[url=http://phishing.com]http://phishing.com[/url]
success
Code:
[url=http://phishing.com]safe-site⠠com[/url]
Code:
[url=http://phishing.com]http://phishing.com[/url]
success
Code:
[url=http://phishing.com]safe-site .com[/url]
Code:
[url=http://phishing.com]http://phishing.com[/url]
success
Code:
[url=http://phishing.com]s a f e-s i t e.com[/url]
Code:
[url=http://phishing.com]http://phishing.com[/url]
success
Code:
[url=http://safe-site.io]safe-site.com[/url]
Code:
[url=http://safe-site.io]http://safe-site.io[/url]
success
Code:
[url=http://phishing.com]safe-site[img]http://asdf.com/period.png[/img]com[/url]
Code:
[url=http://phishing.com]http://phishing.com[/url]
failure
Code:
[url=http://phishing.com]http://safe-site com[/url]
Code:
[url=http://phishing.com]http://safe-site com[/url]
success
Code:
[url=http://safe-site.com]http://safe-site.com[/url]
Code:
[url=http://safe-site.com]http://safe-site.com[/url]
success
Code:
[url=safe-site.com]safe-site.com[/url]
Code:
[url=safe-site.com]safe-site.com[/url]
success
Code:
[url=http://safe-site.com]safe-site.com[/url]
Code:
[url=http://safe-site.com]safe-site.com[/url]
success
Code:
[url=safe-site.com]http://safe-site.com[/url]
Code:
[url=safe-site.com]http://safe-site.com[/url]
success
Code:
[url=http://safe-site.com][img]http://asdf.com/image.png[/img][/url]
Code:
[url=http://safe-site.com][img]http://asdf.com/image.png[/img][/url]
success
Code:
[url=http://safe-site.com]  [img]http://asdf.com/image.png[/img][/url]
Code:
[url=http://safe-site.com]  [img]http://asdf.com/image.png[/img][/url]
success
Code:
[url=http://safe-site.com]safe-site.com is a good site[/url]
Code:
[url=http://safe-site.com]safe-site.com is a good site[/url]
success
Code:
[url=http://safe-site.com]Welcome to safe-site.com![/url]
Code:
[url=http://safe-site.com]Welcome to safe-site.com![/url]
success
Code:
[url=http://safe-site.com]こんにちは。[/url]
Code:
[url=http://safe-site.com]こんにちは。[/url]
success
Code:
Some normal text
Code:
Some normal text
success
Code:
[url=http://safe-site.com]☺☺☺ Hello ☺☺☺[/url]
Code:
[url=http://safe-site.com]☺☺☺ Hello ☺☺☺[/url]
failure
Code:
[url=http://safe-site.com]Hello☺World[/url]
Code:
[url=http://safe-site.com]http://safe-site.com[/url]
success
Code:
[url=http://safe-site.com]Hello ☺ World[/url]
Code:
[url=http://safe-site.com]Hello ☺ World[/url]