Generic Extractor#

Module type

This is the generic extractor used by auto-archiver, which uses yt-dlp under the hood.

This module is responsible for downloading and processing media content from platforms supported by yt-dlp, such as YouTube, Facebook, and others. It provides functionality for retrieving videos, subtitles, comments, and other metadata, and it integrates with the broader archiving framework.

Features#

Supports downloading videos and playlists.
Retrieves metadata like titles, descriptions, upload dates, and durations.
Downloads subtitles and comments when enabled.
Configurable options for handling live streams, proxies, and more.
Supports authentication of websites using the ‘authentication’ settings from your orchestration.

Dropins#

For websites supported by yt-dlp that also contain posts in addition to videos (e.g. Facebook, Twitter, Bluesky), dropins can be created to extract post data and create metadata objects. Some dropins are included in this generic_archiver by default, but custom dropins can be created to handle additional websites and passed to the archiver via the command line using the --dropins option (TODO!).

Configuration Options#

YAML#

generic_extractor:
  subtitles: true
  comments: false
  livestreams: false
  live_from_start: false
  proxy: ''
  end_means_success: true
  allow_playlist: false
  max_downloads: inf

Command Line:#

Option	Description	Default	Type
`generic_extractor.subtitles`	Optional. download subtitles if available	True	bool
`generic_extractor.comments`	Optional. download all comments if available, may lead to large metadata	False	bool
`generic_extractor.livestreams`	Optional. if set, will download live streams, otherwise will skip them; see –max-filesize for more control	False	bool
`generic_extractor.live_from_start`	Optional. if set, will download live streams from their earliest available moment, otherwise starts now.	False	bool
`generic_extractor.proxy`	Optional. http/socks (https seems to not work atm) proxy to use for the webdriver, eg https://proxy-user:password@proxy-ip:port		string
`generic_extractor.end_means_success`	Optional. if True, any archived content will mean a ‘success’, if False this archiver will not return a ‘success’ stage; this is useful for cases when the yt-dlp will archive a video but ignore other types of content like images or text only pages that the subsequent archivers can retrieve.	True	bool
`generic_extractor.allow_playlist`	Optional. If True will also download playlists, set to False if the expectation is to download a single video.	False	bool
`generic_extractor.max_downloads`	Optional. Use to limit the number of videos to download when a channel or long page is being extracted. ‘inf’ means no limit.	inf	string

API Reference

Generic Extractor

Contents

Generic Extractor#

Features#

Dropins#

Configuration Options#

YAML#

Command Line:#