core

core#

Core modules to handle things such as orchestration, metadata and configs..

Submodules#

Package Contents#

class core.Metadata#

status: str = 'no archiver'#

metadata: Dict[str, Any]#

media: List[core.media.Media] = []#

merge(right: Metadata, overwrite_left=True) → Metadata#

Merges another Metadata instance into this one.

Conflicts are resolved based on the overwrite_left flag: - If True, this instance’s values are overwritten by right. - If False, the inverse applies.

store(storages=[])#

set(key: str, val: Any) → Metadata#

append(key: str, val: Any) → Metadata#

get(key: str, default: Any = None, create_if_missing=False) → Metadata | str#

success(context: str = None) → Metadata#

is_success() → bool#

is_empty() → bool#

property netloc: str#

set_url(url: str) → Metadata#

get_url() → str#

set_content(content: str) → Metadata#

set_title(title: str) → Metadata#

get_title() → str#

set_timestamp(timestamp: datetime.datetime) → Metadata#

get_timestamp(utc=True, iso=True) → datetime.datetime | str | None#

add_media(media: core.media.Media, id: str = None) → Metadata#

get_media_by_id(id: str, default=None) → core.media.Media#

remove_duplicate_media_by_hash() → None#

get_first_image(default=None) → core.media.Media#

set_final_media(final: core.media.Media) → Metadata#: final media is a special type of media: if you can show only 1 this is it, it’s useful for some DBs like GsheetDb

get_final_media() → core.media.Media#

get_all_media() → List[core.media.Media]#

static choose_most_complete(results: List[Metadata]) → Metadata#

set_context(key: str, val: Any) → Metadata#

get_context(key: str, default: Any = None) → Any#

class core.Media#

Represents a media file with associated properties and storage details.

Attributes: - filename: The file path of the media as saved locally (temporarily, before uploading to the storage). - urls: A list of URLs where the media is stored or accessible. - properties: Additional metadata or transformations for the media. - _mimetype: The media’s mimetype (e.g., image/jpeg, video/mp4).

filename: str#

urls: List[str] = []#

properties: dict#

store(metadata: Any, url: str = 'url-not-available', storages: List[Any] = None) → None#

all_inner_media(include_self=False) → Iterator[Media]#: Retrieves all media, including nested media within properties or transformations on original media. This function returns a generator for all the inner media.

is_stored(in_storage) → bool#

property key: str#

set(key: str, value: Any) → Media#

get(key: str, default: Any = None) → Any#

add_url(url: str) → None#

property mimetype: str#

is_video() → bool#

is_audio() → bool#

is_image() → bool#

is_valid_video() → bool#

class core.BaseModule#

Bases: abc.ABC

Base module class. All modules should inherit from this class.

The exact methods a class implements will depend on the type of module it is, however modules can have a .setup() method to run any setup code (e.g. logging in to a site, spinning up a browser etc.)

See consts.MODULE_TYPES for the types of modules you can create, noting that a subclass can be of multiple types. For example, a module that extracts data from a website and stores it in a database would be both an ‘extractor’ and a ‘database’ module.

Each module is a python package, and should have a __manifest__.py file in the same directory as the module file. The __manifest__.py specifies the module information like name, author, version, dependencies etc. See DEFAULT_MANIFEST for the default manifest structure.

MODULE_TYPES = ['feeder', 'extractor', 'enricher', 'database', 'storage', 'formatter']#

config: Mapping[str, Any]#

authentication: Mapping[str, Mapping[str, str]]#

name: str#

module_factory: core.module.ModuleFactory#

tmp_dir: tempfile.TemporaryDirectory = None#

property storages: list#

config_setup(config: dict)#

setup()#

auth_for_site(site: str, extract_cookies=True) → Mapping[str, Any]#

Returns the authentication information for a given site. This is used to authenticate with a site before extracting data. The site should be the domain of the site, e.g. ‘twitter.com’

Parameters:

site – the domain of the site to get authentication information for
extract_cookies – whether or not to extract cookies from the given browser/file and return the cookie jar (disabling can speed up processing if you don’t actually need the cookies jar).

Returns:

authdict dict -> { “username”: str, “password”: str, “api_key”: str, “api_secret”: str, “cookie”: str, “cookies_file”: str, “cookies_from_browser”: str, “cookies_jar”: CookieJar

}

Global options:

cookies_from_browser: str - the name of the browser to extract cookies from (e.g. ‘chrome’, ‘firefox’ - uses ytdlp under the hood to extract
cookies_file: str - the path to a cookies file to use for login

Currently, the sites dict can have keys of the following types:

username: str - the username to use for login
password: str - the password to use for login
api_key: str - the API key to use for login
api_secret: str - the API secret to use for login
cookie: str - a cookie string to use for login (specific to this site)
cookies_file: str - the path to a cookies file to use for login (specific to this site)
cookies_from_browser: str - the name of the browser to extract cookies from (specitic for this site)

repr()#

class core.Database#

Bases: auto_archiver.core.BaseModule

Base class for implementing database modules in the media archiving framework.

Subclasses must implement the fetch and done methods to define platform-specific behavior.

started(item: auto_archiver.core.Metadata) → None#: signals the DB that the given item archival has started

failed(item: auto_archiver.core.Metadata, reason: str) → None#: update DB accordingly for failure

aborted(item: auto_archiver.core.Metadata) → None#: abort notification if user cancelled after start

fetch(item: auto_archiver.core.Metadata) → auto_archiver.core.Metadata | bool#: check and fetch if the given item has been archived already, each database should handle its own caching, and configuration mechanisms

abstract done(item: auto_archiver.core.Metadata, cached: bool = False) → None#: archival result ready - should be saved to DB

class core.Enricher#

Bases: auto_archiver.core.BaseModule

Base classes and utilities for enrichers in the Auto Archiver system.

Enricher modules must implement the enrich method to define their behavior.

abstract enrich(to_enrich: auto_archiver.core.Metadata) → None#

Enriches a Metadata object with additional information or context.

Takes the metadata object to enrich as an argument and modifies it in place, returning None.

class core.Feeder#

Bases: auto_archiver.core.BaseModule

Base class for implementing feeders in the media archiving framework.

Subclasses must implement the __iter__ method to define platform-specific behavior.

class core.Storage#

Bases: auto_archiver.core.BaseModule

Base class for implementing storage modules in the media archiving framework.

Subclasses must implement the get_cdn_url and uploadf methods to define their behavior.

store(media: auto_archiver.core.Media, url: str, metadata: auto_archiver.core.Metadata = None) → None#

abstract get_cdn_url(media: auto_archiver.core.Media) → str#: Returns the URL of the media object stored in the CDN.

abstract uploadf(file: IO[bytes], key: str, **kwargs: dict) → bool#

Uploads (or saves) a file to the storage service/location.

This method should not be called directly, but instead through the ‘store’ method, which sets up the media for storage.

upload(media: auto_archiver.core.Media, **kwargs) → bool#

Uploads a media object to the storage service.

This method should not be called directly, but instead be called through the ‘store’ method, which sets up the media for storage.

set_key(media: auto_archiver.core.Media, url: str, metadata: auto_archiver.core.Metadata) → None#: takes the media and optionally item info and generates a key

class core.Extractor#

Bases: auto_archiver.core.BaseModule

Base class for implementing extractors in the media archiving framework. Subclasses must implement the download method to define platform-specific behavior.

valid_url: re.Pattern = None#

cleanup() → None#: Called when extractors are done, or upon errors, cleanup any resources

sanitize_url(url: str) → str#: Used to clean unnecessary URL parameters OR unfurl redirect links

match_link(url: str) → re.Match#

Returns a match object if the given URL matches the valid_url pattern or False/None if not.

Normally used in the suitable method to check if the URL is supported by this extractor.

suitable(url: str) → bool#

Returns True if this extractor can handle the given URL

Should be overridden by subclasses

download_from_url(url: str, to_filename: str = None, verbose=True) → str#: downloads a URL to provided filename, or inferred from URL, returns local filename

abstract download(item: auto_archiver.core.Metadata) → auto_archiver.core.Metadata | False#

Downloads the media from the given URL and returns a Metadata object with the downloaded media.

If the URL is not supported or the download fails, this method should return False.

class core.Formatter#

Bases: auto_archiver.core.BaseModule

Base class for implementing formatters in the media archiving framework.

Subclasses must implement the format method to define their behavior.

abstract format(item: auto_archiver.core.Metadata) → auto_archiver.core.Media#: Formats a Metadata object into a user-viewable format (e.g. HTML) and stores it if needed.