WACZ Enricher (and Extractor)

WACZ Enricher (and Extractor)#

Module type

enricher, extractor

Creates .WACZ archives of web pages using the browsertrix-crawler tool, with options for media extraction and screenshot saving. Browsertrix-crawler is a headless browser-based crawler that archives web pages in WACZ format.

Features#

  • Archives web pages into .WACZ format using Docker or direct invocation of browsertrix-crawler.

  • Supports custom profiles for archiving private or dynamic content.

  • Extracts media (images, videos, audio) and screenshots from the archive, optionally adding them to the enrichment pipeline.

  • Generates metadata from the archived page’s content and structure (e.g., titles, text).

Notes#

  • Requires Docker for running browsertrix-crawler .

  • Configurable via parameters for timeout, media extraction, screenshots, and proxy settings.

Configuration Options#

YAML#

# steps configuration
steps:
...
  enrichers:
  - wacz_extractor_enricher
  extractors:
  - wacz_extractor_enricher
...

# module configuration
...

wacz_extractor_enricher:
  profile:
  docker_commands:
  timeout: 120
  extract_media: false
  extract_screenshot: true
  socks_proxy_host:
  socks_proxy_port:
  proxy_server:

Command Line:#

Option

Description

Default

Type

wacz_extractor_enricher.profile

Optional. browsertrix-profile (for profile generation see webrecorder/browsertrix-crawler).

None

string

wacz_extractor_enricher.docker_commands

Optional. if a custom docker invocation is needed

None

string

wacz_extractor_enricher.timeout

Optional. timeout for WACZ generation in seconds

120

int

wacz_extractor_enricher.extract_media

Optional. If enabled all the images/videos/audio present in the WACZ archive will be extracted into separate Media and appear in the html report. The .wacz file will be kept untouched.

False

bool

wacz_extractor_enricher.extract_screenshot

Optional. If enabled the screenshot captured by browsertrix will be extracted into separate Media and appear in the html report. The .wacz file will be kept untouched.

True

bool

wacz_extractor_enricher.socks_proxy_host

Optional. SOCKS proxy host for browsertrix-crawler, use in combination with socks_proxy_port. eg: user:password@host

None

string

wacz_extractor_enricher.socks_proxy_port

Optional. SOCKS proxy port for browsertrix-crawler, use in combination with socks_proxy_host. eg 1234

None

int

wacz_extractor_enricher.proxy_server

Optional. SOCKS server proxy URL, in development

None

string

API Reference