Google Sheets Feeder#
Module type
GsheetsFeeder A Google Sheets-based feeder for the Auto Archiver.
This reads data from Google Sheets and filters rows based on user-defined rules.
The filtered rows are processed into Metadata objects.
Features#
Validates the sheet structure and filters rows based on input configurations.
Processes only worksheets allowed by the
allow_worksheetsandblock_worksheetsconfigurations.Ensures only rows with valid URLs and unprocessed statuses are included for archival.
Supports organizing stored files into folder paths based on sheet and worksheet names.
Notes#
Requires a Google Service Account JSON file for authentication. Suggested location is
secrets/gsheets_service_account.json.Create the sheet using the template provided in the docs.
Configuration Options#
YAML#
gsheet_feeder:
sheet:
sheet_id:
header: 1
service_account: secrets/service_account.json
columns:
url: link
status: archive status
folder: destination folder
archive: archive location
date: archive date
thumbnail: thumbnail
timestamp: upload timestamp
title: upload title
text: text content
screenshot: screenshot
hash: hash
pdq_hash: perceptual hashes
wacz: wacz
replaywebpage: replaywebpage
allow_worksheets: !!set {}
block_worksheets: !!set {}
use_sheet_names_in_stored_paths: true
Command Line:#
Option |
Description |
Default |
Type |
|---|---|---|---|
|
Optional. name of the sheet to archive |
None |
string |
|
Optional. (alternative to sheet name) the id of the sheet to archive |
None |
string |
|
Optional. index of the header row (starts at 1) |
1 |
int |
|
Optional. service account JSON file path |
secrets/service_account.json |
string |
|
Optional. names of columns in the google sheet (stringified JSON object) |
{‘url’: ‘link’, ‘status’: ‘archive status’, ‘folder’: ‘destination folder’, ‘archive’: ‘archive location’, ‘date’: ‘archive date’, ‘thumbnail’: ‘thumbnail’, ‘timestamp’: ‘upload timestamp’, ‘title’: ‘upload title’, ‘text’: ‘text content’, ‘screenshot’: ‘screenshot’, ‘hash’: ‘hash’, ‘pdq_hash’: ‘perceptual hashes’, ‘wacz’: ‘wacz’, ‘replaywebpage’: ‘replaywebpage’} |
auto_archiver.utils.json_loader |
|
Optional. (CSV) only worksheets whose name is included in allow are included (overrides worksheet_block), leave empty so all are allowed |
set() |
string |
|
Optional. (CSV) explicitly block some worksheets from being processed |
set() |
string |
|
Optional. if True the stored files path will include ‘workbook_name/worksheet_name/…’ |
True |
bool |