
# Google Sheets Feeder Database
```{admonition} Module type

<span style='color: #FFA500'>[feeder](/core_modules.md#feeder-modules)</a></span>, <span style='color: #FF00FF'>[database](/core_modules.md#database-modules)</a></span>
```

GsheetsFeederDatabase
A Google Sheets-based feeder and optional database for the Auto Archiver.

This reads data from Google Sheets and filters rows based on user-defined rules.
The filtered rows are processed into `Metadata` objects.

### Features
- Validates the sheet structure and filters rows based on input configurations.
- Processes only worksheets allowed by the `allow_worksheets` and `block_worksheets` configurations.
- Ensures only rows with valid URLs and unprocessed statuses are included for archival.
- Supports organizing stored files into folder paths based on sheet and worksheet names.
- If the database is enabled, this updates the Google Sheet with the status of the archived URLs, including in progress, success or failure, and method used.
- Saves metadata such as title, text, timestamp, hashes, screenshots, and media URLs to designated columns.
- Formats media-specific metadata, such as thumbnails and PDQ hashes for the sheet.
- Skips redundant updates for empty or invalid data fields.

### Setup
1. Requires a Google Service Account JSON file for authentication.
To set up a service account, follow the instructions in the [how to](https://auto-archiver.readthedocs.io/en/latest/how_to/gsheets_setup.html),
or use the script:
```
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/bellingcat/auto-archiver/refs/heads/main/scripts/generate_google_services.sh)"
```
2. Create a Google sheet with the required column(s) and then define the `sheet` or `sheet_id` configuration to specify this sheet.
3. Customize the column names in your Google sheet using the `columns` configuration.
4. The Google Sheet can be used solely as a feeder or as a feeder and database, but note you can't currently feed into the database from an alternate feeder.


## Configuration Options

### YAML
```{code} yaml

# steps configuration
steps:
...
  feeders:
  - gsheet_feeder_db
  databases:
  - gsheet_feeder_db
...

# module configuration
...

gsheet_feeder_db:
  sheet:
  sheet_id:
  header: 1
  service_account: secrets/service_account.json
  columns:
    url: link
    status: archive status
    folder: destination folder
    archive: archive location
    date: archive date
    thumbnail: thumbnail
    timestamp: upload timestamp
    title: upload title
    text: text content
    screenshot: screenshot
    hash: hash
    pdq_hash: perceptual hashes
    wacz: wacz
    replaywebpage: replaywebpage
  allow_worksheets: !!set {}
  block_worksheets: !!set {}
  use_sheet_names_in_stored_paths: true



```

### Command Line:
| Option | Description | Default | Type|
| --- | --- | --- | --- |
| `gsheet_feeder_db.sheet` | Optional. name of the sheet to archive | None | string |
| `gsheet_feeder_db.sheet_id` | Optional. the id of the sheet to archive (alternative to 'sheet' config) | None | string |
| `gsheet_feeder_db.header` | Optional. index of the header row (starts at 1) | 1 | int |
| `gsheet_feeder_db.service_account` | **Required**. service account JSON file path. Learn how to create one: https://gspread.readthedocs.io/en/latest/oauth2.html | secrets/service_account.json | string |
| `gsheet_feeder_db.columns` | Optional. Custom names for the columns in your Google sheet. If you don't want to use the default column names, change them with this setting | {'url': 'link', 'status': 'archive status', 'folder': 'destination folder', 'archive': 'archive location', 'date': 'archive date', 'thumbnail': 'thumbnail', 'timestamp': 'upload timestamp', 'title': 'upload title', 'text': 'text content', 'screenshot': 'screenshot', 'hash': 'hash', 'pdq_hash': 'perceptual hashes', 'wacz': 'wacz', 'replaywebpage': 'replaywebpage'} | json_loader |
| `gsheet_feeder_db.allow_worksheets` | Optional. A list of worksheet names that should be processed (overrides worksheet_block), leave empty so all are allowed | set() | string |
| `gsheet_feeder_db.block_worksheets` | Optional. A list of worksheet names for worksheets that should be explicitly blocked from being processed | set() | string |
| `gsheet_feeder_db.use_sheet_names_in_stored_paths` | Optional. if True the stored files path will include 'workbook_name/worksheet_name/...' | True | bool |

[API Reference](../../../autoapi/gsheet_feeder_db/index)
