Antibot Extractor/Enricher#

Module type

extractor, enricher

Uses a browser controlled by SeleniumBase to capture HTML, media, and screenshots/PDFs of a web page, by bypassing anti-bot measures like Cloudflare’s Turnstile or Google Recaptcha.

⚠️ Still in trial development, please report any issues or suggestions via GitHub Issues.

Features#

  • Extracts the HTML source code of the page.

  • Takes full-page screenshots of web pages.

  • Takes full-page PDF snapshots of web pages.

  • Downloads images and videos from the page, excluding specified file extensions.

Notes#

  • Using a proxy affects Cloudflare Turnstile captcha handling, so it is recommended to use a proxy only if necessary.

Dropins#

This module uses sub-modules called Dropins for specific sites that allow it to handle anti-bot measures and custom Login flows. You don’t need to include the dropins in your configuration, but you do need to add authentication credentials if you want to overcome login walls on those sites, see detailed instructions for each Dropin below.

Available Dropins#

TikTok Dropin#

Handles TikTok posts and works without authentication. NOTE: This dropin is highly susceptible to TikTok’s bot detection mechanisms and may not work reliably if you reuse the same IP. The GenericExtractor is recommended for TikTok posts, as it handles video/image download more reliable. In the future we plan to implement better anti captcha measures for this dropin.

Site: tiktok.com

VKontakte Dropin#

Handles VKontakte posts and works without authentication for some content.

Site: vk.com

YAML configuration:

authentication:
...
  vk.com:
    username: "phone number with country code"
    password: "password"
...

Linkedin Dropin#

Handles LinkedIn pages/posts and requires authentication to access most content but will still be useful without it. The first time you login to a new IP, LinkedIn may require an email verification code, you can do a manual login first and then it won’t ask for it again.

Site: linkedin.com

YAML configuration:

authentication:
...
  linkedin.com:
    username: "email address or phone number"
    password: "password"
...

Reddit Dropin#

Handles Reddit posts and works without authentication until Reddit flags your IP, so authentication is advised.

Site: reddit.com

YAML configuration:

authentication:
...
  reddit.com:
    username: "email address or username"
    password: "password"
...

Configuration Options#

YAML#

# steps configuration
steps:
...
  extractors:
  - antibot_extractor_enricher
  enrichers:
  - antibot_extractor_enricher
...

# module configuration
...

antibot_extractor_enricher:
  save_to_pdf: false
  max_download_images: 50
  max_download_videos: 50
  user_data_dir: secrets/antibot_user_data
  detect_auth_wall: true
  proxy:

Command Line:#

Option

Description

Default

Type

antibot_extractor_enricher.save_to_pdf

Optional. save a PDF snapshot of the page.

False

bool

antibot_extractor_enricher.max_download_images

Optional. maximum number of images to download from the page (0 = no download, inf = no limit).

50

string

antibot_extractor_enricher.max_download_videos

Optional. maximum number of videos to download from the page (0 = no download, inf = no limit).

50

string

antibot_extractor_enricher.user_data_dir

Optional. Path to the user data directory for the webdriver. This is used to persist browser state, such as cookies and local storage. If you use the docker deployment, this path will be appended with _docker that is because the folder cannot be shared between the host and the container due to user permissions.

secrets/antibot_user_data

string

antibot_extractor_enricher.detect_auth_wall

Optional. detect if the page is behind an authentication wall (e.g. login required) and skip it. disable if you want to archive pages where logins are required.

True

bool

antibot_extractor_enricher.proxy

Optional. proxy to use for the webdriver, Format: ‘SERVER:PORT’ or ‘USER:PASS@SERVER:PORT’

None

string

API Reference