WACZ Enricher (and Extractor)#

Module type

Creates .WACZ archives of web pages using the browsertrix-crawler tool, with options for media extraction and screenshot saving. Browsertrix-crawler is a headless browser-based crawler that archives web pages in WACZ format.

Features#

Archives web pages into .WACZ format using Docker or direct invocation of browsertrix-crawler.
Supports custom profiles for archiving private or dynamic content.
Extracts media (images, videos, audio) and screenshots from the archive, optionally adding them to the enrichment pipeline.
Generates metadata from the archived page’s content and structure (e.g., titles, text).

Setup#

Using Docker#

If you are using the Auto Archiver Docker image to run Auto Archiver (recommended), then everything is set up and you can use WACZ out of the box! Otherwise, if you are using a local install of Auto Archiver (e.g. pip or dev install), then you will need to install Docker and run the docker daemon to be able to run the browsertrix-crawler tool.

Browsertrix Profiles#

A browsertrix profile is a custom browser profile (login information, browser extensions, etc.) that can be used to archive private or dynamic content. You can run the WACZ Enricher without a profile, but for more resilient archiving, it is recommended to create a profile. See the Browsertrix documentation for more information on how to use the create-login-profile tool.

Docker in Docker#

If you are running Auto Archiver within a Docker container, you will need to enable Docker in Docker to run the browsertrix-crawler tool. This can be done by setting the WACZ_ENABLE_DOCKER environment variable to 1.

Configuration Options#

YAML#

# steps configuration
steps:
...
  enrichers:
  - wacz_extractor_enricher
  extractors:
  - wacz_extractor_enricher
...

# module configuration
...

wacz_extractor_enricher:
  profile:
  docker_commands:
  timeout: 120
  extract_media: false
  extract_screenshot: true
  socks_proxy_host:
  socks_proxy_port:
  proxy_server:

Command Line:#

Option	Description	Default	Type
`wacz_extractor_enricher.profile`	Optional. browsertrix-profile (for profile generation see https://crawler.docs.browsertrix.com/user-guide/browser-profiles/).	None	string
`wacz_extractor_enricher.docker_commands`	Optional. if a custom docker invocation is needed	None	string
`wacz_extractor_enricher.timeout`	Optional. timeout for WACZ generation in seconds	120	int
`wacz_extractor_enricher.extract_media`	Optional. If enabled all the images/videos/audio present in the WACZ archive will be extracted into separate Media and appear in the html report. The .wacz file will be kept untouched.	False	bool
`wacz_extractor_enricher.extract_screenshot`	Optional. If enabled the screenshot captured by browsertrix will be extracted into separate Media and appear in the html report. The .wacz file will be kept untouched.	True	bool
`wacz_extractor_enricher.socks_proxy_host`	Optional. SOCKS proxy host for browsertrix-crawler, use in combination with socks_proxy_port. eg: user:password@host	None	string
`wacz_extractor_enricher.socks_proxy_port`	Optional. SOCKS proxy port for browsertrix-crawler, use in combination with socks_proxy_host. eg 1234	None	int
`wacz_extractor_enricher.proxy_server`	Optional. SOCKS server proxy URL, in development	None	string

API Reference

WACZ Enricher (and Extractor)

Contents

WACZ Enricher (and Extractor)#

Features#

Setup#

Using Docker#

Browsertrix Profiles#

Docker in Docker#

Configuration Options#

YAML#

Command Line:#