Google Sheets Feeder Database#

Module type

GsheetsFeederDatabase A Google Sheets-based feeder and optional database for the Auto Archiver.

This reads data from Google Sheets and filters rows based on user-defined rules. The filtered rows are processed into Metadata objects.

Features#

Validates the sheet structure and filters rows based on input configurations.
Processes only worksheets allowed by the allow_worksheets and block_worksheets configurations.
Ensures only rows with valid URLs and unprocessed statuses are included for archival.
Supports organizing stored files into folder paths based on sheet and worksheet names.
If the database is enabled, this updates the Google Sheet with the status of the archived URLs, including in progress, success or failure, and method used.
Saves metadata such as title, text, timestamp, hashes, screenshots, and media URLs to designated columns.
Formats media-specific metadata, such as thumbnails and PDQ hashes for the sheet.
Skips redundant updates for empty or invalid data fields.

Setup#

Requires a Google Service Account JSON file for authentication, which should be stored in secrets/gsheets_service_account.json. To set up a service account, follow the instructions here.
Define the sheet or sheet_id configuration to specify the sheet to archive.
Customize the column names in your Google sheet using the columns configuration.
The Google Sheet can be used soley as a feeder or as a feeder and database, but note you can’t currently feed into the database from an alternate feeder.

Configuration Options#

YAML#

# steps configuration
steps:
...
  feeders:
  - gsheet_feeder_db
  databases:
  - gsheet_feeder_db
...

# module configuration
...

gsheet_feeder_db:
  sheet:
  sheet_id:
  header: 1
  service_account: secrets/service_account.json
  columns:
    url: link
    status: archive status
    folder: destination folder
    archive: archive location
    date: archive date
    thumbnail: thumbnail
    timestamp: upload timestamp
    title: upload title
    text: text content
    screenshot: screenshot
    hash: hash
    pdq_hash: perceptual hashes
    wacz: wacz
    replaywebpage: replaywebpage
  allow_worksheets: !!set {}
  block_worksheets: !!set {}
  use_sheet_names_in_stored_paths: true

Command Line:#

Option	Description	Default	Type
`gsheet_feeder_db.sheet`	Optional. name of the sheet to archive	None	string
`gsheet_feeder_db.sheet_id`	Optional. the id of the sheet to archive (alternative to ‘sheet’ config)	None	string
`gsheet_feeder_db.header`	Optional. index of the header row (starts at 1)	1	int
`gsheet_feeder_db.service_account`	Required. service account JSON file path. Learn how to create one: https://gspread.readthedocs.io/en/latest/oauth2.html	secrets/service_account.json	string
`gsheet_feeder_db.columns`	Optional. Custom names for the columns in your Google sheet. If you don’t want to use the default column names, change them with this setting	{‘url’: ‘link’, ‘status’: ‘archive status’, ‘folder’: ‘destination folder’, ‘archive’: ‘archive location’, ‘date’: ‘archive date’, ‘thumbnail’: ‘thumbnail’, ‘timestamp’: ‘upload timestamp’, ‘title’: ‘upload title’, ‘text’: ‘text content’, ‘screenshot’: ‘screenshot’, ‘hash’: ‘hash’, ‘pdq_hash’: ‘perceptual hashes’, ‘wacz’: ‘wacz’, ‘replaywebpage’: ‘replaywebpage’}	json_loader
`gsheet_feeder_db.allow_worksheets`	Optional. A list of worksheet names that should be processed (overrides worksheet_block), leave empty so all are allowed	set()	string
`gsheet_feeder_db.block_worksheets`	Optional. A list of worksheet names for worksheets that should be explicitly blocked from being processed	set()	string
`gsheet_feeder_db.use_sheet_names_in_stored_paths`	Optional. if True the stored files path will include ‘workbook_name/worksheet_name/…’	True	bool

API Reference

Google Sheets Feeder Database

Contents

Google Sheets Feeder Database#

Features#

Setup#

Configuration Options#

YAML#

Command Line:#