Stefano Fiorucci
crawler refactoring
82fe524
|
raw
history blame
817 Bytes

Twin Peaks crawler

This crawler download texts and metadata from Twin Peaks Fandom Wiki. The output format is JSON. The crawler is based on the combination of Scrapy and fandom-py.

Several wiki pages are discarded, since they are not related to Twin Peaks plot and create noise in the Question Answering index.

Installation

  • pip install -r requirements.txt
  • copy this folder (if needed, see stackoverflow)

Usage

  • (if needed, activate the virtual environment)
  • cd tpcrawler
  • scrapy crawl tpcrawler
  • you can find the downloaded pages in data subfolder