Antibot Extractor/Enricher#
Uses a browser controlled by SeleniumBase to capture HTML, media, and screenshots/PDFs of a web page, by bypassing anti-bot measures like Cloudflare’s Turnstile or Google Recaptcha.
⚠️ Still in trial development, please report any issues or suggestions via GitHub Issues.
Features#
Extracts the HTML source code of the page.
Takes full-page screenshots of web pages.
Takes full-page PDF snapshots of web pages.
Downloads images and videos from the page, excluding specified file extensions.
Notes#
Using a proxy affects Cloudflare Turnstile captcha handling, so it is recommended to use a proxy only if necessary.
Dropins#
This module uses sub-modules called Dropins for specific sites that allow it to handle anti-bot measures and custom Login flows. You don’t need to include the dropins in your configuration, but you do need to add authentication credentials if you want to overcome login walls on those sites, see detailed instructions for each Dropin below.
Available Dropins#
TikTok Dropin#
Handles TikTok posts and works without authentication. NOTE: This dropin is highly susceptible to TikTok’s bot detection mechanisms and may not work reliably if you reuse the same IP. The GenericExtractor is recommended for TikTok posts, as it handles video/image download more reliable. In the future we plan to implement better anti captcha measures for this dropin.
Site: tiktok.com
VKontakte Dropin#
Handles VKontakte posts and works without authentication for some content.
Site: vk.com
YAML configuration:
authentication:
...
vk.com:
username: "phone number with country code"
password: "password"
...
Linkedin Dropin#
Handles LinkedIn pages/posts and requires authentication to access most content but will still be useful without it. The first time you login to a new IP, LinkedIn may require an email verification code, you can do a manual login first and then it won’t ask for it again.
Site: linkedin.com
YAML configuration:
authentication:
...
linkedin.com:
username: "email address or phone number"
password: "password"
...
Reddit Dropin#
Handles Reddit posts and works without authentication until Reddit flags your IP, so authentication is advised.
Site: reddit.com
YAML configuration:
authentication:
...
reddit.com:
username: "email address or username"
password: "password"
...
Configuration Options#
YAML#
# steps configuration
steps:
...
extractors:
- antibot_extractor_enricher
enrichers:
- antibot_extractor_enricher
...
# module configuration
...
antibot_extractor_enricher:
save_to_pdf: false
max_download_images: 50
max_download_videos: 50
user_data_dir: secrets/antibot_user_data
detect_auth_wall: true
proxy:
Command Line:#
Option |
Description |
Default |
Type |
|---|---|---|---|
|
Optional. save a PDF snapshot of the page. |
False |
bool |
|
Optional. maximum number of images to download from the page (0 = no download, inf = no limit). |
50 |
string |
|
Optional. maximum number of videos to download from the page (0 = no download, inf = no limit). |
50 |
string |
|
Optional. Path to the user data directory for the webdriver. This is used to persist browser state, such as cookies and local storage. If you use the docker deployment, this path will be appended with |
secrets/antibot_user_data |
string |
|
Optional. detect if the page is behind an authentication wall (e.g. login required) and skip it. disable if you want to archive pages where logins are required. |
True |
bool |
|
Optional. proxy to use for the webdriver, Format: ‘SERVER:PORT’ or ‘USER:PASS@SERVER:PORT’ |
None |
string |