Post
Topic
Board Service Discussion
Re: GAW Zen Hashlet PayCoin unofficial uncensored discussion. ALWAYS MAKE MONEY :-)
by
ikeboy
on 24/12/2014, 21:57:37 UTC
My scripts started getting banned from the IP address by hashtalk, so I'm posting them here. Please run them on any server that you don't need to browse hashtalk on.

>cat archivehashtalk.sh
sed -ne 's/.*\(http[^"]*\).*/\1/p' < archive.json >archive.list
while read in; do archive.sh "$in";done
>cat archive.sh
curl --data url=$1 https://archive.today/submit/

>cat run.sh
while true
do
archivehashtalk.sh
done

>cat scrapesite.py
from scrapy.item import Field, Item
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor


class SampleItem(Item):
    link = Field()


class SampleSpider(CrawlSpider):
    name = "sample_spider"
    allowed_domains = ["hashtalk.org"]
    start_urls = ["https://hashtalk.org"]

    rules = (
        Rule(LinkExtractor(), callback='parse_page', follow=True),
    )

    def parse_page(self, response):
        item = SampleItem()
        item['link'] = response.url
        return item



Instructions: install scrapy. Create all these files with the names I have in one directory. Install tmux to run multiple commands at once. Run "scrapy runspider scrapesite.py -t json -o archive.json". Open a new tmux terminal. Run ./run.sh . When you get banned (you'll see 403 errors coming from scrapy), try to switch your public ip address. I'm doing this on ec2 free tier, but maybe people have better lying around.

I know the code is a little hacked up, but it works, and nobody else was doing it. I don't want to provide support for this; there should be enough people seeing this who know what to do, and we only need a few to catch all the pages. If you know code and understand what I'm doing, and think it should be done differently, I'll take advice.