WACZ Enricher#
Creates .WACZ archives of web pages using the browsertrix-crawler tool, with options for media extraction and screenshot saving.
Browsertrix-crawler is a headless browser-based crawler that archives web pages in WACZ format.
Features#
Archives web pages into .WACZ format using Docker or direct invocation of
browsertrix-crawler.Supports custom profiles for archiving private or dynamic content.
Extracts media (images, videos, audio) and screenshots from the archive, optionally adding them to the enrichment pipeline.
Generates metadata from the archived page’s content and structure (e.g., titles, text).
Notes#
Requires Docker for running
browsertrix-crawler.Configurable via parameters for timeout, media extraction, screenshots, and proxy settings.
Configuration Options#
YAML#
wacz_enricher:
profile:
docker_commands:
timeout: 120
extract_media: false
extract_screenshot: true
socks_proxy_host:
socks_proxy_port:
proxy_server:
Command Line:#
Option |
Description |
Default |
Type |
|---|---|---|---|
|
Optional. browsertrix-profile (for profile generation see webrecorder/browsertrix-crawler). |
None |
string |
|
Optional. if a custom docker invocation is needed |
None |
string |
|
Optional. timeout for WACZ generation in seconds |
120 |
string |
|
Optional. If enabled all the images/videos/audio present in the WACZ archive will be extracted into separate Media and appear in the html report. The .wacz file will be kept untouched. |
False |
string |
|
Optional. If enabled the screenshot captured by browsertrix will be extracted into separate Media and appear in the html report. The .wacz file will be kept untouched. |
True |
string |
|
Optional. SOCKS proxy host for browsertrix-crawler, use in combination with socks_proxy_port. eg: user:password@host |
None |
string |
|
Optional. SOCKS proxy port for browsertrix-crawler, use in combination with socks_proxy_host. eg 1234 |
None |
string |
|
Optional. SOCKS server proxy URL, in development |
None |
string |