
# Generic Extractor
```{admonition} Module type

<span style='color: #00FF00'>[extractor](/core_modules.md#extractor-modules)</a></span>
```

This is the generic extractor used by auto-archiver, which uses `yt-dlp` under the hood.

This module is responsible for downloading and processing media content from platforms
supported by `yt-dlp`, such as YouTube, Facebook, and others. It provides functionality
for retrieving videos, subtitles, comments, and other metadata, and it integrates with
the broader archiving framework.

For a full list of video platforms supported by `yt-dlp`, see the
[official documentation](https://github.com/yt-dlp/yt-dlp/blob/master/supportedsites.md)

### Features
- Supports downloading videos and playlists.
- Retrieves metadata like titles, descriptions, upload dates, and durations.
- Downloads subtitles and comments when enabled.
- Configurable options for handling live streams, proxies, and more.
- Supports authentication of websites using the 'authentication' settings from your orchestration.

### Dropins
- For websites supported by `yt-dlp` that also contain posts in addition to videos
(e.g. Facebook, Twitter, Bluesky), dropins can be created to extract post data and create 
metadata objects. Some dropins are included in this generic_archiver by default, but
custom dropins can be created to handle additional websites and passed to the archiver
via the command line using the `--dropins` option (TODO!).

You can see all currently implemented dropins in [the source code](https://github.com/bellingcat/auto-archiver/tree/main/src/auto_archiver/modules/generic_extractor).

### Auto-Updates

The Generic Extractor will also automatically check for updates to `yt-dlp` (every 5 days by default).
This can be configured using the `ytdlp_update_interval` setting (or disabled by setting it to -1).
If you are having issues with the extractor, you can review the version of `yt-dlp` being used with `yt-dlp --version`.



## Configuration Options

### YAML
```{code} yaml

# steps configuration
steps:
...
  extractors:
  - generic_extractor
...

# module configuration
...

generic_extractor:
  subtitles: true
  comments: false
  livestreams: false
  live_from_start: false
  proxy: ''
  end_means_success: true
  allow_playlist: false
  max_downloads: inf
  bguils_po_token_method: auto
  extractor_args: {}
  ytdlp_update_interval: 5
  ytdlp_args: ''



```

### Command Line:
| Option | Description | Default | Type|
| --- | --- | --- | --- |
| `generic_extractor.subtitles` | Optional. download subtitles if available | True | bool |
| `generic_extractor.comments` | Optional. download all comments if available, may lead to large metadata | False | bool |
| `generic_extractor.livestreams` | Optional. if set, will download live streams, otherwise will skip them; see --max-filesize for more control | False | bool |
| `generic_extractor.live_from_start` | Optional. if set, will download live streams from their earliest available moment, otherwise starts now. | False | bool |
| `generic_extractor.proxy` | Optional. http/https/socks proxy to use for the webdriver, eg https://proxy-user:password@proxy-ip:port |  | string |
| `generic_extractor.end_means_success` | Optional. if True, any archived content will mean a 'success', if False this extractor will not return a 'success' stage; this is useful for cases when the yt-dlp will archive a video but ignore other types of content like images or text only pages that the subsequent extractors can retrieve. | True | bool |
| `generic_extractor.allow_playlist` | Optional. If True will also download playlists, set to False if the expectation is to download a single video. | False | bool |
| `generic_extractor.max_downloads` | Optional. Use to limit the number of videos to download when a channel or long page is being extracted. 'inf' means no limit. | inf | string |
| `generic_extractor.bguils_po_token_method` | Optional. Set up a Proof of origin token provider. This process has additional requirements. See [authentication](https://auto-archiver.readthedocs.io/en/latest/how_to/authentication_how_to.html) for more information. | auto | string |
| `generic_extractor.extractor_args` | Optional. Additional arguments to pass to the yt-dlp extractor. See https://github.com/yt-dlp/yt-dlp/blob/master/README.md#extractor-arguments. | {} | json_loader |
| `generic_extractor.ytdlp_update_interval` | Optional. How often to check for yt-dlp updates (days). If positive, will check and update yt-dlp every [num] days. Set it to -1 to disable, or 0 to always update on every run. | 5 | int |
| `generic_extractor.ytdlp_args` | Optional. Additional arguments to pass to yt-dlp, e.g. --no-check-certificate or --plugin-dirs.See yt-dlp documentation here for more information: https://github.com/yt-dlp/yt-dlp?tab=readme-ov-file#general-optionsNote: this is not to be confused with 'extractor_args' which are specific to the extractor itself. |  | string |

[API Reference](../../../autoapi/generic_extractor/index)
