We need more data - Scrapy

March 12, 2017 · 4 min read

This post is about the Starcraft bot I am developing using machine learning. The project is being developed as part of the "Daj Się Poznać 2017" competition.

Every aspiring Starcraft 2 player knows that it is not enough to play to be good. You need, among other things, to analyze your games, as well as the games of other players who are better than you. It would be nice if my bot could do something like that too, at least to a limited extent. I will therefore need replays. Where to get them from? The most common way is to take them from sites like spawningtool.com or ggtracker.com, where they are published by players. Organizers of large tournaments also provide game packs from professional players, but searching the Internet to get these packs does not interest me.

So I decided to write a simple scrapper that goes through subpages, pulls links from tables and downloads games from spawningtool.com.

The code

First, of course, I installed Scrapy (it later turned out that you would also need the requests library):

pip install scrapy

To start using Scrapy you can either write some simple script right away and fire it using the scrapy runspider <script> command, or create a project. I chose the latter option and ran the following command in the root directory of the project:

scrapy startproject replays

Next, in the settings.py file, I set the pipeline, which allows files to be downloaded, and also set the path to which files should be downloaded:

ITEM_PIPELINES = {
   'scrapy.contrib.pipeline.images.FilesPipeline': 1,
}

FILES_STORE = "./files"

Then, in the items.py file, I created a simple model with only the fields that are required for Scrapy to download files:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class ReplaysItem(scrapy.Item):
    file_urls = scrapy.Field()
	files = scrapy.Field()

In the spiders directory, I created a spawning-tool-spider.py file with the following content:

import scrapy
import requests
from replays.items import ReplayItem


class SpawningToolSpider(scrapy.Spider):
    name = 'spawning-tool-spider'
    # first page with replays, scrapy has to start somewhere
    start_urls = [
        'http://lotv.spawningtool.com/replays/?p=&query=&after_time=&'
        'before_time=&after_played_on=&before_played_on=&patch=&order_by='
    ]

    # scrapy running in the console fires this method and starts scraping
    def parse(self, response):
        for row in response.css('table tr'):
            # skip the first row of the table
            if row.css('td:last-child ::attr(href)').extract_first() is None:
                continue
            else:
                url = 'http://lotv.spawningtool.com' \
                    + row.css('td:last-child ::attr(href)').extract_first()

                # a typical path with a download file is http://lotv.spawningtool.com/<number>/download/
                # but when downloading there is a 302 redirect to Amazon,
                # so something had to be done about it, because the scrapy refused to work
                # The solution turned out to be pulling out the Location field from the headers,
                # which is the correct address
                request_response = requests.head(url)

                if request_response.status_code == 302:
                    url = request_response.headers["Location"]

                # split, because scrapy tried to save a file with GET field values
                # (something like 322d3a25z2z.SC2Replay?key=0JIDaAJ)
                url = url.split('?')

                # ok, we have the path to the file
                yield ReplayItem(file_urls=[url[0]])

        # go to the next page
        next_page = response.css('a.pull-right ::attr(href)').extract_first()
        if next_page:
            yield scrapy.Request(response.urljoin(next_page), callback=self.parse)

Then all you had to do was fire up the command in the console:

scrapy crawl spawning-tool-spider

What's next?

For now, however, I'll give myself a break from trying to predict what the environment from Blizzard and DeepMind will contain. I haven't even downloaded the replays because I don't know if there's any point. I think this week I'll try to recreate the scenarios of a simple micro from Starcraft for web browsers and under that I'll write examples using neural networks and reinforcement learning.

The code​

What's next?​

The code

What's next?