Post
Topic
Board Service Discussion
Re: GAW / Josh Garza discussion. Paycoin XPY CoinStand Mineral. ALWAYS MAKE MONEY :)
by
Crestington
on 19/04/2015, 02:26:46 UTC
Hey guys, just in case you wanted to build your own local, searchable archive of humorous scammery. This script will scrape the entire hashtalk site and download every thread in RSS form.

This will create a shitload of RSS files, so be sure to run in its own subdirectory, ie ~/hashtalkscambullshit/

Code:
nano get_ht.py

then paste in the following

Code:
#/bin/bash

# if your name is homero, better hire the nations smartest developers and most feared lawyers to help you with this... we all know you have trouble with computers.

import urllib, os, time

def getstart():
        startfrom = input('What post to start from? enter 0 to start from beginning: ')
        endon = input('What post to finish on? enter something like 40000 to get the whole site: ')
        runloop( startfrom, endon );


def runloop( i, x ):
        currentlyat = 0
        pauseat = 10
        total = x - i
        while i <= x:
                if currentlyat == pauseat:
                        os.system('clear')
                        print "Always Profitable! Hang on, grabbing threads %s - %s.... Hopefully PayCoin doesn't reach $0 before we're finished!" % ( i, i + 10)
                        pauseat = pauseat + 10
                        time.sleep(1)
                else:
                        os.system('wget -b -a htwgetlog --no-check-certificate -q --show-progress https://hashtalk.org/topic/%s.rss > /dev/null' % i)
                        i = i + 1
                        currentlyat = currentlyat + 1
        os.system('clear')
        print "%s threads downloaded." % total
        print "We're done! Enjoy searching the scam database."



getstart()

execute

Code:
python get_ht.py

It'll ask you which post to start from (0 if you're beginning or another number if you're resuming a previous scrape)

Then it'll ask which post to stop at. I don't know what the currently highest known post is, haven't bothered to check.


Oh, whats that you say? You say you're only interested in Homero's Sales pitches? No problem!

Code:
#/bin/bash
wget https://hashtalk.org/user/mrceo/topics.rss


bump...

updated to reflect that there are only about 37000 posts total, so set your max to ~38000 to catch everything.

TypeError: not all arguments converted during string formatting


Sorry! I think I fixed the issue.

Fun fact: This thread alone has 33,000 posts