Configuration Cheat Sheet

Configuration Cheat Sheet#

Below is a list of all configurations for the core modules in Auto Archiver

Configuration Options#

Option

Description

Default

Type

csv_db.csv_file

Optional. CSV file name

db.csv

string

generic_extractor.subtitles

Optional. download subtitles if available

True

bool

generic_extractor.comments

Optional. download all comments if available, may lead to large metadata

False

bool

generic_extractor.livestreams

Optional. if set, will download live streams, otherwise will skip them; see –max-filesize for more control

False

bool

generic_extractor.live_from_start

Optional. if set, will download live streams from their earliest available moment, otherwise starts now.

False

bool

generic_extractor.proxy

Optional. http/socks (https seems to not work atm) proxy to use for the webdriver, eg https://proxy-user:password@proxy-ip:port

string

generic_extractor.end_means_success

Optional. if True, any archived content will mean a ‘success’, if False this archiver will not return a ‘success’ stage; this is useful for cases when the yt-dlp will archive a video but ignore other types of content like images or text only pages that the subsequent archivers can retrieve.

True

bool

generic_extractor.allow_playlist

Optional. If True will also download playlists, set to False if the expectation is to download a single video.

False

bool

generic_extractor.max_downloads

Optional. Use to limit the number of videos to download when a channel or long page is being extracted. ‘inf’ means no limit.

inf

string

hash_enricher.algorithm

Optional. hash algorithm to use

SHA-256

string

hash_enricher.chunksize

Optional. number of bytes to use when reading files in chunks (if this value is too large you will run out of RAM), default is 16MB

16000000

int

html_formatter.detect_thumbnails

Optional. if true will group by thumbnails generated by thumbnail enricher by id ‘thumbnail_00’

True

string

local_storage.path_generator

Optional. how to store the file in terms of directory structure: ‘flat’ sets to root; ‘url’ creates a directory based on the provided URL; ‘random’ creates a random directory.

flat

string

local_storage.filename_generator

Optional. how to name stored files: ‘random’ creates a random string; ‘static’ uses a replicable strategy such as a hash.

static

string

local_storage.save_to

Optional. folder where to save archived content

./local_archive

string

local_storage.save_absolute

Optional. whether the path to the stored file is absolute or relative in the output result inc. formatters (WARN: leaks the file structure)

False

string

ssl_enricher.skip_when_nothing_archived

Optional. if true, will skip enriching when no media is archived

True

string

thumbnail_enricher.thumbnails_per_minute

Optional. how many thumbnails to generate per minute of video, can be limited by max_thumbnails

60

string

thumbnail_enricher.max_thumbnails

Optional. limit the number of thumbnails to generate per video, 0 means no limit

16

string

api_db.api_endpoint

Required. API endpoint where calls are made to

string

api_db.api_token

Optional. API Bearer token.

None

string

api_db.public

Optional. whether the URL should be publicly available via the API

False

bool

api_db.author_id

Optional. which email to assign as author

None

string

api_db.group_id

Optional. which group of users have access to the archive in case public=false as author

None

string

api_db.use_api_cache

Optional. if False then the API database will be queried prior to any archiving operations and stop if the link has already been archived

True

bool

api_db.store_results

Optional. when set, will send the results to the API database.

True

bool

api_db.tags

Optional. what tags to add to the archived URL

[]

string

atlos_db.api_token

Required. An Atlos API token. For more information, see https://docs.atlos.org/technical/api/

None

string

atlos_db.atlos_url

Optional. The URL of your Atlos instance (e.g., https://platform.atlos.org), without a trailing slash.

https://platform.atlos.org

string

atlos_feeder.api_token

Required. An Atlos API token. For more information, see https://docs.atlos.org/technical/api/

string

atlos_feeder.atlos_url

Optional. The URL of your Atlos instance (e.g., https://platform.atlos.org), without a trailing slash.

https://platform.atlos.org

string

atlos_storage.api_token

Required. An Atlos API token. For more information, see https://docs.atlos.org/technical/api/

None

string

atlos_storage.atlos_url

Optional. The URL of your Atlos instance (e.g., https://platform.atlos.org), without a trailing slash.

https://platform.atlos.org

string

csv_feeder.files

Required. Path to the input file(s) to read the URLs from, comma separated. Input files should be formatted with one URL per line

None

valid_file

csv_feeder.column

Optional. Column number or name to read the URLs from, 0-indexed

None

string

gdrive_storage.path_generator

Optional. how to store the file in terms of directory structure: ‘flat’ sets to root; ‘url’ creates a directory based on the provided URL; ‘random’ creates a random directory.

url

string

gdrive_storage.filename_generator

Optional. how to name stored files: ‘random’ creates a random string; ‘static’ uses a replicable strategy such as a hash.

static

string

gdrive_storage.root_folder_id

Required. root google drive folder ID to use as storage, found in URL: ‘https://drive.google.com/drive/folders/FOLDER_ID

string

gdrive_storage.oauth_token

Optional. JSON filename with Google Drive OAuth token: check auto-archiver repository scripts folder for create_update_gdrive_oauth_token.py. NOTE: storage used will count towards owner of GDrive folder, therefore it is best to use oauth_token_filename over service_account.

None

string

gdrive_storage.service_account

Optional. service account JSON file path, same as used for Google Sheets. NOTE: storage used will count towards the developer account.

secrets/service_account.json

string

gsheet_db.allow_worksheets

Optional. (CSV) only worksheets whose name is included in allow are included (overrides worksheet_block), leave empty so all are allowed

set()

string

gsheet_db.block_worksheets

Optional. (CSV) explicitly block some worksheets from being processed

set()

string

gsheet_db.use_sheet_names_in_stored_paths

Optional. if True the stored files path will include ‘workbook_name/worksheet_name/…’

True

bool

gsheet_feeder.sheet

Optional. name of the sheet to archive

None

string

gsheet_feeder.sheet_id

Optional. (alternative to sheet name) the id of the sheet to archive

None

string

gsheet_feeder.header

Optional. index of the header row (starts at 1)

1

int

gsheet_feeder.service_account

Optional. service account JSON file path

secrets/service_account.json

string

gsheet_feeder.columns

Optional. names of columns in the google sheet (stringified JSON object)

{‘url’: ‘link’, ‘status’: ‘archive status’, ‘folder’: ‘destination folder’, ‘archive’: ‘archive location’, ‘date’: ‘archive date’, ‘thumbnail’: ‘thumbnail’, ‘timestamp’: ‘upload timestamp’, ‘title’: ‘upload title’, ‘text’: ‘text content’, ‘screenshot’: ‘screenshot’, ‘hash’: ‘hash’, ‘pdq_hash’: ‘perceptual hashes’, ‘wacz’: ‘wacz’, ‘replaywebpage’: ‘replaywebpage’}

auto_archiver.utils.json_loader

gsheet_feeder.allow_worksheets

Optional. (CSV) only worksheets whose name is included in allow are included (overrides worksheet_block), leave empty so all are allowed

set()

string

gsheet_feeder.block_worksheets

Optional. (CSV) explicitly block some worksheets from being processed

set()

string

gsheet_feeder.use_sheet_names_in_stored_paths

Optional. if True the stored files path will include ‘workbook_name/worksheet_name/…’

True

bool

instagram_api_extractor.access_token

Optional. a valid instagrapi-api token

None

string

instagram_api_extractor.api_endpoint

Required. API endpoint to use

string

instagram_api_extractor.full_profile

Optional. if true, will download all posts, tagged posts, stories, and highlights for a profile, if false, will only download the profile pic and information.

False

bool

instagram_api_extractor.full_profile_max_posts

Optional. Use to limit the number of posts to download when full_profile is true. 0 means no limit. limit is applied softly since posts are fetched in batch, once to: posts, tagged posts, and highlights

0

int

instagram_api_extractor.minimize_json_output

Optional. if true, will remove empty values from the json output

True

bool

instagram_extractor.username

Required. a valid Instagram username

string

instagram_extractor.password

Required. the corresponding Instagram account password

string

instagram_extractor.download_folder

Optional. name of a folder to temporarily download content to

instaloader

string

instagram_extractor.session_file

Optional. path to the instagram session which saves session credentials

secrets/instaloader.session

string

instagram_tbot_extractor.api_id

Optional. telegram API_ID value, go to https://my.telegram.org/apps

None

string

instagram_tbot_extractor.api_hash

Optional. telegram API_HASH value, go to https://my.telegram.org/apps

None

string

instagram_tbot_extractor.session_file

Optional. optional, records the telegram login session for future usage, ‘.session’ will be appended to the provided value.

secrets/anon-insta

string

instagram_tbot_extractor.timeout

Optional. timeout to fetch the instagram content in seconds.

45

int

s3_storage.path_generator

Optional. how to store the file in terms of directory structure: ‘flat’ sets to root; ‘url’ creates a directory based on the provided URL; ‘random’ creates a random directory.

flat

string

s3_storage.filename_generator

Optional. how to name stored files: ‘random’ creates a random string; ‘static’ uses a replicable strategy such as a hash.

static

string

s3_storage.bucket

Optional. S3 bucket name

None

string

s3_storage.region

Optional. S3 region name

None

string

s3_storage.key

Optional. S3 API key

None

string

s3_storage.secret

Optional. S3 API secret

None

string

s3_storage.random_no_duplicate

Optional. if set, it will override path_generator, filename_generator and folder. It will check if the file already exists and if so it will not upload it again. Creates a new root folder path no-dups/

False

bool

s3_storage.endpoint_url

Optional. S3 bucket endpoint, {region} are inserted at runtime

https://{region}.digitaloceanspaces.com

string

s3_storage.cdn_url

Optional. S3 CDN url, {bucket}, {region} and {key} are inserted at runtime

https://{bucket}.{region}.cdn.digitaloceanspaces.com/{key}

string

s3_storage.private

Optional. if true S3 files will not be readable online

False

bool

screenshot_enricher.width

Optional. width of the screenshots

1280

string

screenshot_enricher.height

Optional. height of the screenshots

720

string

screenshot_enricher.timeout

Optional. timeout for taking the screenshot

60

string

screenshot_enricher.sleep_before_screenshot

Optional. seconds to wait for the pages to load before taking screenshot

4

string

screenshot_enricher.http_proxy

Optional. http proxy to use for the webdriver, eg http://proxy-user:password@proxy-ip:port

string

screenshot_enricher.save_to_pdf

Optional. save the page as pdf along with the screenshot. PDF saving options can be adjusted with the ‘print_options’ parameter

False

string

screenshot_enricher.print_options

Optional. options to pass to the pdf printer

{}

string

telethon_extractor.api_id

Optional. telegram API_ID value, go to https://my.telegram.org/apps

None

string

telethon_extractor.api_hash

Optional. telegram API_HASH value, go to https://my.telegram.org/apps

None

string

telethon_extractor.bot_token

Optional. optional, but allows access to more content such as large videos, talk to @botfather

None

string

telethon_extractor.session_file

Optional. optional, records the telegram login session for future usage, ‘.session’ will be appended to the provided value.

secrets/anon

string

telethon_extractor.join_channels

Optional. disables the initial setup with channel_invites config, useful if you have a lot and get stuck

True

string

telethon_extractor.channel_invites

Optional. (JSON string) private channel invite links (format: t.me/joinchat/HASH OR t.me/+HASH) and (optional but important to avoid hanging for minutes on startup) channel id (format: CHANNEL_ID taken from a post url like https://t.me/c/CHANNEL_ID/1), the telegram account will join any new channels on setup

{}

auto_archiver.utils.json_loader

timestamping_enricher.tsa_urls

Optional. List of RFC3161 Time Stamp Authorities to use, separate with commas if passed via the command line.

[‘http://timestamp.digicert.com’, ‘http://timestamp.identrust.com’, ‘http://timestamp.globalsign.com/tsa/r6advanced1’, ‘http://tss.accv.es:8318/tsa’]

string

twitter_api_extractor.bearer_token

Optional. [deprecated: see bearer_tokens] twitter API bearer_token which is enough for archiving, if not provided you will need consumer_key, consumer_secret, access_token, access_secret

None

string

twitter_api_extractor.bearer_tokens

Optional. a list of twitter API bearer_token which is enough for archiving, if not provided you will need consumer_key, consumer_secret, access_token, access_secret, if provided you can still add those for better rate limits. CSV of bearer tokens if provided via the command line

[]

string

twitter_api_extractor.consumer_key

Optional. twitter API consumer_key

None

string

twitter_api_extractor.consumer_secret

Optional. twitter API consumer_secret

None

string

twitter_api_extractor.access_token

Optional. twitter API access_token

None

string

twitter_api_extractor.access_secret

Optional. twitter API access_secret

None

string

vk_extractor.username

Required. valid VKontakte username

string

vk_extractor.password

Required. valid VKontakte password

string

vk_extractor.session_file

Optional. valid VKontakte password

secrets/vk_config.v2.json

string

wacz_enricher.profile

Optional. browsertrix-profile (for profile generation see webrecorder/browsertrix-crawler).

None

string

wacz_enricher.docker_commands

Optional. if a custom docker invocation is needed

None

string

wacz_enricher.timeout

Optional. timeout for WACZ generation in seconds

120

string

wacz_enricher.extract_media

Optional. If enabled all the images/videos/audio present in the WACZ archive will be extracted into separate Media and appear in the html report. The .wacz file will be kept untouched.

False

string

wacz_enricher.extract_screenshot

Optional. If enabled the screenshot captured by browsertrix will be extracted into separate Media and appear in the html report. The .wacz file will be kept untouched.

True

string

wacz_enricher.socks_proxy_host

Optional. SOCKS proxy host for browsertrix-crawler, use in combination with socks_proxy_port. eg: user:password@host

None

string

wacz_enricher.socks_proxy_port

Optional. SOCKS proxy port for browsertrix-crawler, use in combination with socks_proxy_host. eg 1234

None

string

wacz_enricher.proxy_server

Optional. SOCKS server proxy URL, in development

None

string

wayback_extractor_enricher.timeout

Optional. seconds to wait for successful archive confirmation from wayback, if more than this passes the result contains the job_id so the status can later be checked manually.

15

string

wayback_extractor_enricher.if_not_archived_within

Optional. only tell wayback to archive if no archive is available before the number of seconds specified, use None to ignore this option. For more information: https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA

None

string

wayback_extractor_enricher.key

Required. wayback API key. to get credentials visit https://archive.org/account/s3.php

string

wayback_extractor_enricher.secret

Required. wayback API secret. to get credentials visit https://archive.org/account/s3.php

string

wayback_extractor_enricher.proxy_http

Optional. http proxy to use for wayback requests, eg http://proxy-user:password@proxy-ip:port

None

string

wayback_extractor_enricher.proxy_https

Optional. https proxy to use for wayback requests, eg https://proxy-user:password@proxy-ip:port

None

string

whisper_enricher.api_endpoint

Required. WhisperApi api endpoint, eg: https://whisperbox-api.com/api/v1, a deployment of bellingcat/whisperbox-transcribe.

string

whisper_enricher.api_key

Required. WhisperApi api key for authentication

string

whisper_enricher.include_srt

Optional. Whether to include a subtitle SRT (SubRip Subtitle file) for the video (can be used in video players).

False

string

whisper_enricher.timeout

Optional. How many seconds to wait at most for a successful job completion.

90

string

whisper_enricher.action

Optional. which Whisper operation to execute

translate

string