
# WACZ Enricher (and Extractor)
```{admonition} Module type

<span style='color: #0000FF'>[enricher](/core_modules.md#enricher-modules)</a></span>, <span style='color: #00FF00'>[extractor](/core_modules.md#extractor-modules)</a></span>
```

Creates .WACZ archives of web pages using the `browsertrix-crawler` tool, with options for media extraction and screenshot saving.
[Browsertrix-crawler](https://crawler.docs.browsertrix.com/user-guide/) is a headless browser-based crawler that archives web pages in WACZ format.

## Setup

**Docker**
If you are using the Docker file to run Auto Archiver (recommended), then everything is set up and you can use WACZ out of the box!
Otherwise, if you are using a local install of Auto Archiver (e.g. pip or dev install), then you will need to install Docker and run 
the docker daemon to be able to run the `browsertrix-crawler` tool.

**Browsertrix Profiles**
A browsertrix profile is a custom browser profile (login information, browser extensions, etc.) that can be used to archive private or dynamic content.
You can run the WACZ Enricher without a profile, but for more resilient archiving, it is recommended to create a profile. See the [Browsertrix documentation](https://crawler.docs.browsertrix.com/user-guide/browser-profiles/)
for more information.

** Docker in Docker **
If you are running Auto Archiver within a Docker container, you will need to enable Docker in Docker to run the `browsertrix-crawler` tool.
This can be done by setting the `WACZ_ENABLE_DOCKER` environment variable to `1`.

## Features
- Archives web pages into .WACZ format using Docker or direct invocation of `browsertrix-crawler`.
- Supports custom profiles for archiving private or dynamic content.
- Extracts media (images, videos, audio) and screenshots from the archive, optionally adding them to the enrichment pipeline.
- Generates metadata from the archived page's content and structure (e.g., titles, text).



## Configuration Options

### YAML
```{code} yaml

# steps configuration
steps:
...
  enrichers:
  - wacz_extractor_enricher
  extractors:
  - wacz_extractor_enricher
...

# module configuration
...

wacz_extractor_enricher:
  profile:
  docker_commands:
  timeout: 120
  extract_media: false
  extract_screenshot: true
  socks_proxy_host:
  socks_proxy_port:
  proxy_server:



```

### Command Line:
| Option | Description | Default | Type|
| --- | --- | --- | --- |
| `wacz_extractor_enricher.profile` | Optional. browsertrix-profile (for profile generation see https://crawler.docs.browsertrix.com/user-guide/browser-profiles/). | None | string |
| `wacz_extractor_enricher.docker_commands` | Optional. if a custom docker invocation is needed | None | string |
| `wacz_extractor_enricher.timeout` | Optional. timeout for WACZ generation in seconds | 120 | int |
| `wacz_extractor_enricher.extract_media` | Optional. If enabled all the images/videos/audio present in the WACZ archive will be extracted into separate Media and appear in the html report. The .wacz file will be kept untouched. | False | bool |
| `wacz_extractor_enricher.extract_screenshot` | Optional. If enabled the screenshot captured by browsertrix will be extracted into separate Media and appear in the html report. The .wacz file will be kept untouched. | True | bool |
| `wacz_extractor_enricher.socks_proxy_host` | Optional. SOCKS proxy host for browsertrix-crawler, use in combination with socks_proxy_port. eg: user:password@host | None | string |
| `wacz_extractor_enricher.socks_proxy_port` | Optional. SOCKS proxy port for browsertrix-crawler, use in combination with socks_proxy_host. eg 1234 | None | int |
| `wacz_extractor_enricher.proxy_server` | Optional. SOCKS server proxy URL, in development | None | string |

[API Reference](../../../autoapi/wacz_extractor_enricher/index)
