Google Sheets Feeder

Google Sheets Feeder#

Module type

feeder

GsheetsFeeder A Google Sheets-based feeder for the Auto Archiver.

This reads data from Google Sheets and filters rows based on user-defined rules. The filtered rows are processed into Metadata objects.

Features#

  • Validates the sheet structure and filters rows based on input configurations.

  • Processes only worksheets allowed by the allow_worksheets and block_worksheets configurations.

  • Ensures only rows with valid URLs and unprocessed statuses are included for archival.

  • Supports organizing stored files into folder paths based on sheet and worksheet names.

Notes#

  • Requires a Google Service Account JSON file for authentication. Suggested location is secrets/gsheets_service_account.json.

  • Create the sheet using the template provided in the docs.

Configuration Options#

YAML#

gsheet_feeder:
  sheet:
  sheet_id:
  header: 1
  service_account: secrets/service_account.json
  columns:
    url: link
    status: archive status
    folder: destination folder
    archive: archive location
    date: archive date
    thumbnail: thumbnail
    timestamp: upload timestamp
    title: upload title
    text: text content
    screenshot: screenshot
    hash: hash
    pdq_hash: perceptual hashes
    wacz: wacz
    replaywebpage: replaywebpage
  allow_worksheets: !!set {}
  block_worksheets: !!set {}
  use_sheet_names_in_stored_paths: true

Command Line:#

Option

Description

Default

Type

gsheet_feeder.sheet

Optional. name of the sheet to archive

None

string

gsheet_feeder.sheet_id

Optional. (alternative to sheet name) the id of the sheet to archive

None

string

gsheet_feeder.header

Optional. index of the header row (starts at 1)

1

int

gsheet_feeder.service_account

Optional. service account JSON file path

secrets/service_account.json

string

gsheet_feeder.columns

Optional. names of columns in the google sheet (stringified JSON object)

{‘url’: ‘link’, ‘status’: ‘archive status’, ‘folder’: ‘destination folder’, ‘archive’: ‘archive location’, ‘date’: ‘archive date’, ‘thumbnail’: ‘thumbnail’, ‘timestamp’: ‘upload timestamp’, ‘title’: ‘upload title’, ‘text’: ‘text content’, ‘screenshot’: ‘screenshot’, ‘hash’: ‘hash’, ‘pdq_hash’: ‘perceptual hashes’, ‘wacz’: ‘wacz’, ‘replaywebpage’: ‘replaywebpage’}

auto_archiver.utils.json_loader

gsheet_feeder.allow_worksheets

Optional. (CSV) only worksheets whose name is included in allow are included (overrides worksheet_block), leave empty so all are allowed

set()

string

gsheet_feeder.block_worksheets

Optional. (CSV) explicitly block some worksheets from being processed

set()

string

gsheet_feeder.use_sheet_names_in_stored_paths

Optional. if True the stored files path will include ‘workbook_name/worksheet_name/…’

True

bool

API Reference