Google Sheets Feeder Database

Google Sheets Feeder Database#

Module type

feeder, database

GsheetsFeederDatabase A Google Sheets-based feeder and optional database for the Auto Archiver.

This reads data from Google Sheets and filters rows based on user-defined rules. The filtered rows are processed into Metadata objects.

Features#

  • Validates the sheet structure and filters rows based on input configurations.

  • Processes only worksheets allowed by the allow_worksheets and block_worksheets configurations.

  • Ensures only rows with valid URLs and unprocessed statuses are included for archival.

  • Supports organizing stored files into folder paths based on sheet and worksheet names.

  • If the database is enabled, this updates the Google Sheet with the status of the archived URLs, including in progress, success or failure, and method used.

  • Saves metadata such as title, text, timestamp, hashes, screenshots, and media URLs to designated columns.

  • Formats media-specific metadata, such as thumbnails and PDQ hashes for the sheet.

  • Skips redundant updates for empty or invalid data fields.

Setup#

  • Requires a Google Service Account JSON file for authentication, which should be stored in secrets/gsheets_service_account.json. To set up a service account, follow the instructions here.

  • Define the sheet or sheet_id configuration to specify the sheet to archive.

  • Customize the column names in your Google sheet using the columns configuration.

  • The Google Sheet can be used soley as a feeder or as a feeder and database, but note you can’t currently feed into the database from an alternate feeder.

Configuration Options#

YAML#

# steps configuration
steps:
...
  feeders:
  - gsheet_feeder_db
  databases:
  - gsheet_feeder_db
...

# module configuration
...

gsheet_feeder_db:
  sheet:
  sheet_id:
  header: 1
  service_account: secrets/service_account.json
  columns:
    url: link
    status: archive status
    folder: destination folder
    archive: archive location
    date: archive date
    thumbnail: thumbnail
    timestamp: upload timestamp
    title: upload title
    text: text content
    screenshot: screenshot
    hash: hash
    pdq_hash: perceptual hashes
    wacz: wacz
    replaywebpage: replaywebpage
  allow_worksheets: !!set {}
  block_worksheets: !!set {}
  use_sheet_names_in_stored_paths: true

Command Line:#

Option

Description

Default

Type

gsheet_feeder_db.sheet

Optional. name of the sheet to archive

None

string

gsheet_feeder_db.sheet_id

Optional. the id of the sheet to archive (alternative to ‘sheet’ config)

None

string

gsheet_feeder_db.header

Optional. index of the header row (starts at 1)

1

int

gsheet_feeder_db.service_account

Required. service account JSON file path. Learn how to create one: https://gspread.readthedocs.io/en/latest/oauth2.html

secrets/service_account.json

string

gsheet_feeder_db.columns

Optional. Custom names for the columns in your Google sheet. If you don’t want to use the default column names, change them with this setting

{‘url’: ‘link’, ‘status’: ‘archive status’, ‘folder’: ‘destination folder’, ‘archive’: ‘archive location’, ‘date’: ‘archive date’, ‘thumbnail’: ‘thumbnail’, ‘timestamp’: ‘upload timestamp’, ‘title’: ‘upload title’, ‘text’: ‘text content’, ‘screenshot’: ‘screenshot’, ‘hash’: ‘hash’, ‘pdq_hash’: ‘perceptual hashes’, ‘wacz’: ‘wacz’, ‘replaywebpage’: ‘replaywebpage’}

json_loader

gsheet_feeder_db.allow_worksheets

Optional. A list of worksheet names that should be processed (overrides worksheet_block), leave empty so all are allowed

set()

string

gsheet_feeder_db.block_worksheets

Optional. A list of worksheet names for worksheets that should be explicitly blocked from being processed

set()

string

gsheet_feeder_db.use_sheet_names_in_stored_paths

Optional. if True the stored files path will include ‘workbook_name/worksheet_name/…’

True

bool

API Reference