
# Antibot Extractor/Enricher
```{admonition} Module type

<span style='color: #00FF00'>[extractor](/core_modules.md#extractor-modules)</a></span>, <span style='color: #0000FF'>[enricher](/core_modules.md#enricher-modules)</a></span>
```

Uses a browser controlled by SeleniumBase to capture HTML, media, and screenshots/PDFs of a web page, by bypassing anti-bot measures like Cloudflare's Turnstile or Google Recaptcha.

> ⚠️ Still in trial development, please report any issues or suggestions via [GitHub Issues](https://github.com/bellingcat/auto-archiver/issues).

### Features
- Extracts the HTML source code of the page.
- Takes full-page screenshots of web pages.
- Takes full-page PDF snapshots of web pages.
- Downloads images and videos from the page, excluding specified file extensions.

### Notes
- Using a proxy affects Cloudflare Turnstile captcha handling, so it is recommended to use a proxy only if necessary.

### Dropins
This module uses sub-modules called Dropins for specific sites that allow it to handle anti-bot measures and custom Login flows. You don't need to include the dropins in your configuration, but you do need to add authentication credentials if you want to overcome login walls on those sites, see detailed instructions for each Dropin below.



##### Available Dropins

###### Linkedin Dropin

Handles LinkedIn pages/posts and requires authentication to access most content but will still be useful without it. The first time you login to a new IP, LinkedIn may require an email verification code, you can do a manual login first and then it won't ask for it again.

**Site**: linkedin.com

**YAML configuration**:
```{code} yaml
authentication:
...
  linkedin.com:
    username: "email address or phone number"
    password: "password"
...
```

###### Reddit Dropin

Handles Reddit posts and works without authentication until Reddit flags your IP, so authentication is advised.

**Site**: reddit.com

**YAML configuration**:
```{code} yaml
authentication:
...
  reddit.com:
    username: "email address or username"
    password: "password"
...
```

###### VKontakte Dropin

Handles VKontakte posts and works without authentication for some content.

**Site**: vk.com

**YAML configuration**:
```{code} yaml
authentication:
...
  vk.com:
    username: "phone number with country code"
    password: "password"
...
```

###### TikTok Dropin

Handles TikTok posts and works without authentication.
NOTE: This dropin is highly susceptible to TikTok's bot detection mechanisms and may not work reliably if you reuse the same IP. The GenericExtractor is recommended for TikTok posts, as it handles video/image download more reliable. In the future we plan to implement better anti captcha measures for this dropin.

**Site**: tiktok.com


## Configuration Options

### YAML
```{code} yaml

# steps configuration
steps:
...
  extractors:
  - antibot_extractor_enricher
  enrichers:
  - antibot_extractor_enricher
...

# module configuration
...

antibot_extractor_enricher:
  save_to_pdf: false
  max_download_images: 50
  max_download_videos: 50
  user_data_dir: secrets/antibot_user_data
  detect_auth_wall: true
  proxy:



```

### Command Line:
| Option | Description | Default | Type|
| --- | --- | --- | --- |
| `antibot_extractor_enricher.save_to_pdf` | Optional. save a PDF snapshot of the page. | False | bool |
| `antibot_extractor_enricher.max_download_images` | Optional. maximum number of images to download from the page (0 = no download, inf = no limit). | 50 | string |
| `antibot_extractor_enricher.max_download_videos` | Optional. maximum number of videos to download from the page (0 = no download, inf = no limit). | 50 | string |
| `antibot_extractor_enricher.user_data_dir` | Optional. Path to the user data directory for the webdriver. This is used to persist browser state, such as cookies and local storage. If you use the docker deployment, this path will be appended with `_docker` that is because the folder cannot be shared between the host and the container due to user permissions. | secrets/antibot_user_data | string |
| `antibot_extractor_enricher.detect_auth_wall` | Optional. detect if the page is behind an authentication wall (e.g. login required) and skip it. disable if you want to archive pages where logins are required. | True | bool |
| `antibot_extractor_enricher.proxy` | Optional. proxy to use for the webdriver, Format: 'SERVER:PORT' or 'USER:PASS@SERVER:PORT' | None | string |

[API Reference](../../../autoapi/antibot_extractor_enricher/index)
