Configuration Cheat Sheet#
Below is a list of all configurations for the core modules in Auto Archiver
Configuration Options#
Option |
Description |
Default |
Type |
|---|---|---|---|
|
Optional. CSV file name |
db.csv |
string |
|
Optional. download subtitles if available |
True |
bool |
|
Optional. download all comments if available, may lead to large metadata |
False |
bool |
|
Optional. if set, will download live streams, otherwise will skip them; see –max-filesize for more control |
False |
bool |
|
Optional. if set, will download live streams from their earliest available moment, otherwise starts now. |
False |
bool |
|
Optional. http/socks (https seems to not work atm) proxy to use for the webdriver, eg https://proxy-user:password@proxy-ip:port |
string |
|
|
Optional. if True, any archived content will mean a ‘success’, if False this archiver will not return a ‘success’ stage; this is useful for cases when the yt-dlp will archive a video but ignore other types of content like images or text only pages that the subsequent archivers can retrieve. |
True |
bool |
|
Optional. If True will also download playlists, set to False if the expectation is to download a single video. |
False |
bool |
|
Optional. Use to limit the number of videos to download when a channel or long page is being extracted. ‘inf’ means no limit. |
inf |
string |
|
Optional. hash algorithm to use |
SHA-256 |
string |
|
Optional. number of bytes to use when reading files in chunks (if this value is too large you will run out of RAM), default is 16MB |
16000000 |
int |
|
Optional. if true will group by thumbnails generated by thumbnail enricher by id ‘thumbnail_00’ |
True |
string |
|
Optional. how to store the file in terms of directory structure: ‘flat’ sets to root; ‘url’ creates a directory based on the provided URL; ‘random’ creates a random directory. |
flat |
string |
|
Optional. how to name stored files: ‘random’ creates a random string; ‘static’ uses a replicable strategy such as a hash. |
static |
string |
|
Optional. folder where to save archived content |
./local_archive |
string |
|
Optional. whether the path to the stored file is absolute or relative in the output result inc. formatters (WARN: leaks the file structure) |
False |
string |
|
Optional. if true, will skip enriching when no media is archived |
True |
string |
|
Optional. how many thumbnails to generate per minute of video, can be limited by max_thumbnails |
60 |
string |
|
Optional. limit the number of thumbnails to generate per video, 0 means no limit |
16 |
string |
|
Required. API endpoint where calls are made to |
string |
|
|
Optional. API Bearer token. |
None |
string |
|
Optional. whether the URL should be publicly available via the API |
False |
bool |
|
Optional. which email to assign as author |
None |
string |
|
Optional. which group of users have access to the archive in case public=false as author |
None |
string |
|
Optional. if False then the API database will be queried prior to any archiving operations and stop if the link has already been archived |
True |
bool |
|
Optional. when set, will send the results to the API database. |
True |
bool |
|
Optional. what tags to add to the archived URL |
[] |
string |
|
Required. An Atlos API token. For more information, see https://docs.atlos.org/technical/api/ |
None |
string |
|
Optional. The URL of your Atlos instance (e.g., https://platform.atlos.org), without a trailing slash. |
string |
|
|
Required. An Atlos API token. For more information, see https://docs.atlos.org/technical/api/ |
string |
|
|
Optional. The URL of your Atlos instance (e.g., https://platform.atlos.org), without a trailing slash. |
string |
|
|
Required. An Atlos API token. For more information, see https://docs.atlos.org/technical/api/ |
None |
string |
|
Optional. The URL of your Atlos instance (e.g., https://platform.atlos.org), without a trailing slash. |
string |
|
|
Required. Path to the input file(s) to read the URLs from, comma separated. Input files should be formatted with one URL per line |
None |
valid_file |
|
Optional. Column number or name to read the URLs from, 0-indexed |
None |
string |
|
Optional. how to store the file in terms of directory structure: ‘flat’ sets to root; ‘url’ creates a directory based on the provided URL; ‘random’ creates a random directory. |
url |
string |
|
Optional. how to name stored files: ‘random’ creates a random string; ‘static’ uses a replicable strategy such as a hash. |
static |
string |
|
Required. root google drive folder ID to use as storage, found in URL: ‘https://drive.google.com/drive/folders/FOLDER_ID’ |
string |
|
|
Optional. JSON filename with Google Drive OAuth token: check auto-archiver repository scripts folder for create_update_gdrive_oauth_token.py. NOTE: storage used will count towards owner of GDrive folder, therefore it is best to use oauth_token_filename over service_account. |
None |
string |
|
Optional. service account JSON file path, same as used for Google Sheets. NOTE: storage used will count towards the developer account. |
secrets/service_account.json |
string |
|
Optional. (CSV) only worksheets whose name is included in allow are included (overrides worksheet_block), leave empty so all are allowed |
set() |
string |
|
Optional. (CSV) explicitly block some worksheets from being processed |
set() |
string |
|
Optional. if True the stored files path will include ‘workbook_name/worksheet_name/…’ |
True |
bool |
|
Optional. name of the sheet to archive |
None |
string |
|
Optional. (alternative to sheet name) the id of the sheet to archive |
None |
string |
|
Optional. index of the header row (starts at 1) |
1 |
int |
|
Optional. service account JSON file path |
secrets/service_account.json |
string |
|
Optional. names of columns in the google sheet (stringified JSON object) |
{‘url’: ‘link’, ‘status’: ‘archive status’, ‘folder’: ‘destination folder’, ‘archive’: ‘archive location’, ‘date’: ‘archive date’, ‘thumbnail’: ‘thumbnail’, ‘timestamp’: ‘upload timestamp’, ‘title’: ‘upload title’, ‘text’: ‘text content’, ‘screenshot’: ‘screenshot’, ‘hash’: ‘hash’, ‘pdq_hash’: ‘perceptual hashes’, ‘wacz’: ‘wacz’, ‘replaywebpage’: ‘replaywebpage’} |
auto_archiver.utils.json_loader |
|
Optional. (CSV) only worksheets whose name is included in allow are included (overrides worksheet_block), leave empty so all are allowed |
set() |
string |
|
Optional. (CSV) explicitly block some worksheets from being processed |
set() |
string |
|
Optional. if True the stored files path will include ‘workbook_name/worksheet_name/…’ |
True |
bool |
|
Optional. a valid instagrapi-api token |
None |
string |
|
Required. API endpoint to use |
string |
|
|
Optional. if true, will download all posts, tagged posts, stories, and highlights for a profile, if false, will only download the profile pic and information. |
False |
bool |
|
Optional. Use to limit the number of posts to download when full_profile is true. 0 means no limit. limit is applied softly since posts are fetched in batch, once to: posts, tagged posts, and highlights |
0 |
int |
|
Optional. if true, will remove empty values from the json output |
True |
bool |
|
Required. a valid Instagram username |
string |
|
|
Required. the corresponding Instagram account password |
string |
|
|
Optional. name of a folder to temporarily download content to |
instaloader |
string |
|
Optional. path to the instagram session which saves session credentials |
secrets/instaloader.session |
string |
|
Optional. telegram API_ID value, go to https://my.telegram.org/apps |
None |
string |
|
Optional. telegram API_HASH value, go to https://my.telegram.org/apps |
None |
string |
|
Optional. optional, records the telegram login session for future usage, ‘.session’ will be appended to the provided value. |
secrets/anon-insta |
string |
|
Optional. timeout to fetch the instagram content in seconds. |
45 |
int |
|
Optional. how to store the file in terms of directory structure: ‘flat’ sets to root; ‘url’ creates a directory based on the provided URL; ‘random’ creates a random directory. |
flat |
string |
|
Optional. how to name stored files: ‘random’ creates a random string; ‘static’ uses a replicable strategy such as a hash. |
static |
string |
|
Optional. S3 bucket name |
None |
string |
|
Optional. S3 region name |
None |
string |
|
Optional. S3 API key |
None |
string |
|
Optional. S3 API secret |
None |
string |
|
Optional. if set, it will override |
False |
bool |
|
Optional. S3 bucket endpoint, {region} are inserted at runtime |
https://{region}.digitaloceanspaces.com |
string |
|
Optional. S3 CDN url, {bucket}, {region} and {key} are inserted at runtime |
https://{bucket}.{region}.cdn.digitaloceanspaces.com/{key} |
string |
|
Optional. if true S3 files will not be readable online |
False |
bool |
|
Optional. width of the screenshots |
1280 |
string |
|
Optional. height of the screenshots |
720 |
string |
|
Optional. timeout for taking the screenshot |
60 |
string |
|
Optional. seconds to wait for the pages to load before taking screenshot |
4 |
string |
|
Optional. http proxy to use for the webdriver, eg http://proxy-user:password@proxy-ip:port |
string |
|
|
Optional. save the page as pdf along with the screenshot. PDF saving options can be adjusted with the ‘print_options’ parameter |
False |
string |
|
Optional. options to pass to the pdf printer |
{} |
string |
|
Optional. telegram API_ID value, go to https://my.telegram.org/apps |
None |
string |
|
Optional. telegram API_HASH value, go to https://my.telegram.org/apps |
None |
string |
|
Optional. optional, but allows access to more content such as large videos, talk to @botfather |
None |
string |
|
Optional. optional, records the telegram login session for future usage, ‘.session’ will be appended to the provided value. |
secrets/anon |
string |
|
Optional. disables the initial setup with channel_invites config, useful if you have a lot and get stuck |
True |
string |
|
Optional. (JSON string) private channel invite links (format: t.me/joinchat/HASH OR t.me/+HASH) and (optional but important to avoid hanging for minutes on startup) channel id (format: CHANNEL_ID taken from a post url like https://t.me/c/CHANNEL_ID/1), the telegram account will join any new channels on setup |
{} |
auto_archiver.utils.json_loader |
|
Optional. List of RFC3161 Time Stamp Authorities to use, separate with commas if passed via the command line. |
[‘http://timestamp.digicert.com’, ‘http://timestamp.identrust.com’, ‘http://timestamp.globalsign.com/tsa/r6advanced1’, ‘http://tss.accv.es:8318/tsa’] |
string |
|
Optional. [deprecated: see bearer_tokens] twitter API bearer_token which is enough for archiving, if not provided you will need consumer_key, consumer_secret, access_token, access_secret |
None |
string |
|
Optional. a list of twitter API bearer_token which is enough for archiving, if not provided you will need consumer_key, consumer_secret, access_token, access_secret, if provided you can still add those for better rate limits. CSV of bearer tokens if provided via the command line |
[] |
string |
|
Optional. twitter API consumer_key |
None |
string |
|
Optional. twitter API consumer_secret |
None |
string |
|
Optional. twitter API access_token |
None |
string |
|
Optional. twitter API access_secret |
None |
string |
|
Required. valid VKontakte username |
string |
|
|
Required. valid VKontakte password |
string |
|
|
Optional. valid VKontakte password |
secrets/vk_config.v2.json |
string |
|
Optional. browsertrix-profile (for profile generation see webrecorder/browsertrix-crawler). |
None |
string |
|
Optional. if a custom docker invocation is needed |
None |
string |
|
Optional. timeout for WACZ generation in seconds |
120 |
string |
|
Optional. If enabled all the images/videos/audio present in the WACZ archive will be extracted into separate Media and appear in the html report. The .wacz file will be kept untouched. |
False |
string |
|
Optional. If enabled the screenshot captured by browsertrix will be extracted into separate Media and appear in the html report. The .wacz file will be kept untouched. |
True |
string |
|
Optional. SOCKS proxy host for browsertrix-crawler, use in combination with socks_proxy_port. eg: user:password@host |
None |
string |
|
Optional. SOCKS proxy port for browsertrix-crawler, use in combination with socks_proxy_host. eg 1234 |
None |
string |
|
Optional. SOCKS server proxy URL, in development |
None |
string |
|
Optional. seconds to wait for successful archive confirmation from wayback, if more than this passes the result contains the job_id so the status can later be checked manually. |
15 |
string |
|
Optional. only tell wayback to archive if no archive is available before the number of seconds specified, use None to ignore this option. For more information: https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA |
None |
string |
|
Required. wayback API key. to get credentials visit https://archive.org/account/s3.php |
string |
|
|
Required. wayback API secret. to get credentials visit https://archive.org/account/s3.php |
string |
|
|
Optional. http proxy to use for wayback requests, eg http://proxy-user:password@proxy-ip:port |
None |
string |
|
Optional. https proxy to use for wayback requests, eg https://proxy-user:password@proxy-ip:port |
None |
string |
|
Required. WhisperApi api endpoint, eg: https://whisperbox-api.com/api/v1, a deployment of bellingcat/whisperbox-transcribe. |
string |
|
|
Required. WhisperApi api key for authentication |
string |
|
|
Optional. Whether to include a subtitle SRT (SubRip Subtitle file) for the video (can be used in video players). |
False |
string |
|
Optional. How many seconds to wait at most for a successful job completion. |
90 |
string |
|
Optional. which Whisper operation to execute |
translate |
string |