Wayback Machine Enricher (and Extractor)

Wayback Machine Enricher (and Extractor)#

Module type

enricher, extractor

Submits the current URL to the Wayback Machine for archiving and returns either a job ID or the completed archive URL.

Features#

  • Archives URLs using the Internet Archive’s Wayback Machine API.

  • Supports conditional archiving based on the existence of prior archives within a specified time range.

  • Provides proxies for HTTP and HTTPS requests.

  • Fetches and confirms the archive URL or provides a job ID for later status checks.

Notes#

  • Requires a valid Wayback Machine API key and secret.

  • Handles rate-limiting by Wayback Machine and retries status checks with exponential backoff.

Steps to Get an Wayback API Key:#

Configuration Options#

YAML#

# steps configuration
steps:
...
  enrichers:
  - wayback_extractor_enricher
  extractors:
  - wayback_extractor_enricher
...

# module configuration
...

wayback_extractor_enricher:
  timeout: 15
  if_not_archived_within:
  key: ''
  secret: ''
  proxy_http:
  proxy_https:

Command Line:#

Option

Description

Default

Type

wayback_extractor_enricher.timeout

Optional. seconds to wait for successful archive confirmation from wayback, if more than this passes the result contains the job_id so the status can later be checked manually.

15

int

wayback_extractor_enricher.if_not_archived_within

Optional. only tell wayback to archive if no archive is available before the number of seconds specified, use None to ignore this option. For more information: https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA

None

string

wayback_extractor_enricher.key

Required. wayback API key. to get credentials visit https://archive.org/account/s3.php

string

wayback_extractor_enricher.secret

Required. wayback API secret. to get credentials visit https://archive.org/account/s3.php

string

wayback_extractor_enricher.proxy_http

Optional. http proxy to use for wayback requests, eg http://proxy-user:password@proxy-ip:port

None

string

wayback_extractor_enricher.proxy_https

Optional. https proxy to use for wayback requests, eg https://proxy-user:password@proxy-ip:port

None

string

API Reference