core.orchestrator#

Orchestrates all archiving steps, including feeding items, archiving them with specific archivers, enrichment, storage, formatting, database operations and clean up.

Module Contents#

class core.orchestrator.ArchivingOrchestrator#
module_factory: core.module.ModuleFactory#
setup_finished: bool#
logger_id: int#
feeders: List[Type[core.Feeder]]#
extractors: List[Type[core.Extractor]]#
enrichers: List[Type[core.Enricher]]#
databases: List[Type[core.Database]]#
storages: List[Type[core.Storage]]#
formatters: List[Type[core.Formatter]]#
setup_basic_parser()#
check_steps(config)#
setup_complete_parser(basic_config: dict, yaml_config: dict, unused_args: list[str]) None#
add_modules_args(parser: argparse.ArgumentParser = None)#
add_additional_args(parser: argparse.ArgumentParser = None)#
add_individual_module_args(modules: list[core.module.LazyBaseModule] = None, parser: argparse.ArgumentParser = None) None#
show_help(basic_config: dict)#
setup_logging(config)#
install_modules(modules_by_type)#

Traverses all modules in ‘steps’ and loads them into the orchestrator, storing them in the orchestrator’s attributes (self.feeders, self.extractors etc.). If no modules of a certain type are loaded, the program will exit with an error message.

load_config(config_file: str) dict#
setup_config(args: list) dict#

Sets up the configuration file, merging the default config with the user’s config

This function should only ever be run once.

check_for_updates()#
setup(args: list)#

Function to configure all setup of the orchestrator: setup configs and load modules.

This method should only ever be called once

cleanup() None#
feed() Generator[core.metadata.Metadata]#
feed_item(item: core.metadata.Metadata) core.metadata.Metadata#
Takes one item (URL) to archive and calls self.archive, additionally:
  • catches keyboard interruptions to do a clean exit

  • catches any unexpected error, logs it, and does a clean exit

archive(result: core.metadata.Metadata) core.metadata.Metadata | None#

Runs the archiving process for a single URL 1. Each archiver can sanitize its own URLs 2. Check for cached results in Databases, and signal start to the databases 3. Call Archivers until one succeeds 4. Call Enrichers 5. Store all downloaded/generated media 6. Call selected Formatter and store formatted if needed

setup_authentication(config: dict) dict#

Setup authentication for all modules that require it

Split up strings into multiple sites if they are comma separated

property all_modules: List[Type[core.base_module.BaseModule]]#