WIScker

About Instructions FAQ
WIScker (Wikipedia Idea Scraper) is a tool for scraping old versions of Wikipedia articles. The corpus resulting from the scrape is then available to you to analyze as you see fit. WIScker supports all languages of Wikipedia.
WIScker was created at the University of Alberta under the INKE project.
The entire tool was written in JavaScript and as a result, can be downloaded for your own use by saving this webpage to your computer. The WIScker source code is also available at GitHub.

In order to scape an article, copy the URL of the Wikipedia article you would like to scrape directly into this form and enter the duration that you would like the article to be scraped for. Then select either Date or Frequency scraping type. The scraping type determines how often you want to collect revisions.

Once the form is filled, press the "Scrape" button and WIScker will print the corpus below the Wikipedia window. To copy the resulting text, press the "Select Text" button and the copy it into whatever software you wish to use to analyze it. Another option is to press the "Send to Voyant" button, which will automatically upload the corpus to Voyant Tools and open a new tab. For best results, use xml when sending to Voyant. The "Reset" button clears the form and the output section.

Why is it taking so long?
An article that has many revisions will take longer to scrape. You can check how many revisions an article has in Wikipedia by clicking "View History," near the top of the page.

What is the difference between plain text formatting and Wikipedia style formatting?
Plain text format is more readable, but has data such as links and the infobox removed.

When scraping by date, when, exactly, does WIScker collect a revision of an article?
WIScker will scrape the text of the article as it was on the day that is selected. However, if the date to start scraping is prior to the article's creation, WIScker will skip to and start at the date of the article's creation.

What formats can the corpus be printed in?
You can request that WIScker print the resulting text as xml or in a human readable format. Characters such as "&" and "<" are removed from the resulting text to prevent xml errors.