Screenshot Enricher

Screenshot Enricher#

Module type

enricher

Captures screenshots and optionally saves web pages as PDFs using a WebDriver.

Features#

  • Takes screenshots of web pages, with configurable width, height, and timeout settings.

  • Optionally saves pages as PDFs, with additional configuration for PDF printing options.

  • Bypasses URLs detected as authentication walls.

  • Integrates seamlessly with the metadata enrichment pipeline, adding screenshots and PDFs as media.

Notes#

  • Requires a WebDriver (e.g., ChromeDriver) installed and accessible via the system’s PATH.

Configuration Options#

YAML#

# steps configuration
steps:
...
  enrichers:
  - screenshot_enricher
...

# module configuration
...

screenshot_enricher:
  width: 1280
  height: 1024
  timeout: 60
  sleep_before_screenshot: 4
  http_proxy: ''
  save_to_pdf: false
  print_options: {}

Command Line:#

Option

Description

Default

Type

screenshot_enricher.width

Optional. width of the screenshots

1280

int

screenshot_enricher.height

Optional. height of the screenshots

1024

int

screenshot_enricher.timeout

Optional. timeout for taking the screenshot

60

int

screenshot_enricher.sleep_before_screenshot

Optional. seconds to wait for the pages to load before taking screenshot

4

int

screenshot_enricher.http_proxy

Optional. http proxy to use for the webdriver, eg http://proxy-user:password@proxy-ip:port

string

screenshot_enricher.save_to_pdf

Optional. save the page as pdf along with the screenshot. PDF saving options can be adjusted with the ‘print_options’ parameter

False

bool

screenshot_enricher.print_options

Optional. options to pass to the pdf printer, in JSON format. See https://www.selenium.dev/documentation/webdriver/interactions/print_page/ for more information

{}

json_loader

API Reference