Wayback Machine Enricher (and Extractor)#
Submits the current URL to the Wayback Machine for archiving and returns either a job ID or the completed archive URL.
Features#
Archives URLs using the Internet Archive’s Wayback Machine API.
Supports conditional archiving based on the existence of prior archives within a specified time range.
Provides proxies for HTTP and HTTPS requests.
Fetches and confirms the archive URL or provides a job ID for later status checks.
Notes#
Requires a valid Wayback Machine API key and secret.
Handles rate-limiting by Wayback Machine and retries status checks with exponential backoff.
Steps to Get an Wayback API Key:#
Sign up for an account at Internet Archive.
Log in to your account.
Navigte to your account settings.
or: https://archive.org/developers/tutorial-get-ia-credentials.html
Under Wayback Machine API Keys, generate a new key.
Note down your API key and secret, as they will be required for authentication.
Configuration Options#
YAML#
# steps configuration
steps:
...
enrichers:
- wayback_extractor_enricher
extractors:
- wayback_extractor_enricher
...
# module configuration
...
wayback_extractor_enricher:
timeout: 15
if_not_archived_within:
key: ''
secret: ''
proxy_http:
proxy_https:
Command Line:#
Option |
Description |
Default |
Type |
|---|---|---|---|
|
Optional. seconds to wait for successful archive confirmation from wayback, if more than this passes the result contains the job_id so the status can later be checked manually. |
15 |
int |
|
Optional. only tell wayback to archive if no archive is available before the number of seconds specified, use None to ignore this option. For more information: https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA |
None |
string |
|
Required. wayback API key. to get credentials visit https://archive.org/account/s3.php |
string |
|
|
Required. wayback API secret. to get credentials visit https://archive.org/account/s3.php |
string |
|
|
Optional. http proxy to use for wayback requests, eg http://proxy-user:password@proxy-ip:port |
None |
string |
|
Optional. https proxy to use for wayback requests, eg https://proxy-user:password@proxy-ip:port |
None |
string |