Metadata-Version: 2.1
Name: scrapy-wayback-middleware
Version: 0.2.0
Summary: Scrapy middleware for submitting URLs to the Internet Archive Wayback Machine
Home-page: https://github.com/City-Bureau/scrapy-wayback-middleware
Author: Pat Sier
Author-email: pat@citybureau.org
License: MIT
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Framework :: Scrapy
Requires-Python: >=3.5,<4.0
Description-Content-Type: text/markdown
Requires-Dist: scrapy

# Scrapy Wayback Middleware

[![Build Status](https://travis-ci.org/City-Bureau/scrapy-wayback-middleware.svg?branch=master)](https://travis-ci.org/City-Bureau/scrapy-wayback-middleware)

Middleware for submitting all scraped response URLs to the [Internet Archive Wayback Machine](https://archive.org/web/) for archival.

## Installation

```bash
pip install scrapy-wayback-middleware
```

## Setup

Add `scrapy_wayback_middleware.WaybackMiddleware` to your project's `SPIDER_MIDDLEWARES` settings. By default, the middleware will make `GET` requests to `web.archive.org/save/{URL}`, but if the `WAYBACK_MIDDLEWARE_POST` setting is `True` then it will make POST requests to [`pragma.archivelab.org`](https://archive.readme.io/docs/creating-a-snapshot) instead.

## Configuration

To configure custom behavior for certain methods, subclass `WaybackMiddleware` and override the `get_item_urls` method to pull additional links to archive from individual items or `handle_wayback` to change how responses from the Wayback Machine are handled. The `WAYBACK_MIDDLEWARE_POST` can be set to `True` to adjust request behavior.

### Duplicate Filtering

In order to avoid sending duplicate requests with `WAYBACK_MIDDLEWARE_POST` set to `False`, you'll need to either include `web.archive.org` in your spider's `allowed_domains` property (if specified) or disable `scrapy.spidermiddlewares.offsite.OffsiteMiddleware` in your settings.

### Rate Limits

While neither endpoint returns headers indicating specific rate limits, the `GET` endpoint at `web.archive.org/save` has a rate limit of 25 requests/minute, resetting each minute. The middleware is configured to wait for 60 seconds whenever it sees a 429 error code to handle this.


