core

Contents

core#

Core modules to handle things such as orchestration, metadata and configs..

Submodules#

Package Contents#

class core.Metadata#
status: str = 'no archiver'#
metadata: Dict[str, Any]#
media: List[core.media.Media] = []#
merge(right: Metadata, overwrite_left=True) Metadata#

Merges another Metadata instance into this one.

Conflicts are resolved based on the overwrite_left flag: - If True, this instance’s values are overwritten by right. - If False, the inverse applies.

store(storages=[])#
set(key: str, val: Any) Metadata#
append(key: str, val: Any) Metadata#
get(key: str, default: Any = None, create_if_missing=False) Metadata | str#
success(context: str = None) Metadata#
is_success() bool#
is_empty() bool#
property netloc: str#
set_url(url: str) Metadata#
get_url() str#
set_content(content: str) Metadata#
set_title(title: str) Metadata#
get_title() str#
set_timestamp(timestamp: datetime.datetime) Metadata#
get_timestamp(utc=True, iso=True) datetime.datetime#
add_media(media: core.media.Media, id: str = None) Metadata#
get_media_by_id(id: str, default=None) core.media.Media#
remove_duplicate_media_by_hash() None#
get_first_image(default=None) core.media.Media#
set_final_media(final: core.media.Media) Metadata#

final media is a special type of media: if you can show only 1 this is it, it’s useful for some DBs like GsheetDb

get_final_media() core.media.Media#
get_all_media() List[core.media.Media]#
static choose_most_complete(results: List[Metadata]) Metadata#
set_context(key: str, val: Any) Metadata#
get_context(key: str, default: Any = None) Any#
class core.Media#

Represents a media file with associated properties and storage details.

Attributes: - filename: The file path of the media. - key: An optional identifier for the media. - urls: A list of URLs where the media is stored or accessible. - properties: Additional metadata or transformations for the media. - _mimetype: The media’s mimetype (e.g., image/jpeg, video/mp4).

filename: str#
key: str = None#
urls: List[str] = []#
properties: dict#
store(metadata: Any, url: str = 'url-not-available', storages: List[Any] = None) None#
all_inner_media(include_self=False)#

Retrieves all media, including nested media within properties or transformations on original media. This function returns a generator for all the inner media.

is_stored(in_storage) bool#
set(key: str, value: Any) Media#
get(key: str, default: Any = None) Any#
add_url(url: str) None#
property mimetype: str#
is_video() bool#
is_audio() bool#
is_image() bool#
is_valid_video() bool#
class core.BaseModule#

Bases: abc.ABC

Base module class. All modules should inherit from this class.

The exact methods a class implements will depend on the type of module it is, however modules can have a .setup() method to run any setup code (e.g. logging in to a site, spinning up a browser etc.)

See BaseModule.MODULE_TYPES for the types of modules you can create, noting that a subclass can be of multiple types. For example, a module that extracts data from a website and stores it in a database would be both an ‘extractor’ and a ‘database’ module.

Each module is a python package, and should have a __manifest__.py file in the same directory as the module file. The __manifest__.py specifies the module information like name, author, version, dependencies etc. See BaseModule._DEFAULT_MANIFEST for the default manifest structure.

MODULE_TYPES = ['feeder', 'extractor', 'enricher', 'database', 'storage', 'formatter']#
config: Mapping[str, Any]#
authentication: Mapping[str, Mapping[str, str]]#
name: str#
tmp_dir: tempfile.TemporaryDirectory = None#
property storages: list#
config_setup(config: dict)#
setup()#
auth_for_site(site: str, extract_cookies=True) Mapping[str, Any]#

Returns the authentication information for a given site. This is used to authenticate with a site before extracting data. The site should be the domain of the site, e.g. ‘twitter.com’

extract_cookies: bool - whether or not to extract cookies from the given browser and return the cookie jar (disabling can speed up) processing if you don’t actually need the cookies jar

Currently, the dict can have keys of the following types: - username: str - the username to use for login - password: str - the password to use for login - api_key: str - the API key to use for login - api_secret: str - the API secret to use for login - cookie: str - a cookie string to use for login (specific to this site) - cookies_jar: YoutubeDLCookieJar | http.cookiejar.MozillaCookieJar - a cookie jar compatible with requests (e.g. requests.get(cookies=cookie_jar))

repr()#
class core.Database#

Bases: auto_archiver.core.BaseModule

Base class for implementing database modules in the media archiving framework.

Subclasses must implement the fetch and done methods to define platform-specific behavior.

started(item: auto_archiver.core.Metadata) None#

signals the DB that the given item archival has started

failed(item: auto_archiver.core.Metadata, reason: str) None#

update DB accordingly for failure

aborted(item: auto_archiver.core.Metadata) None#

abort notification if user cancelled after start

fetch(item: auto_archiver.core.Metadata) auto_archiver.core.Metadata | bool#

check and fetch if the given item has been archived already, each database should handle its own caching, and configuration mechanisms

abstract done(item: auto_archiver.core.Metadata, cached: bool = False) None#

archival result ready - should be saved to DB

class core.Enricher#

Bases: auto_archiver.core.BaseModule

Base classes and utilities for enrichers in the Auto-Archiver system.

Enricher modules must implement the enrich method to define their behavior.

abstract enrich(to_enrich: auto_archiver.core.Metadata) None#

Enriches a Metadata object with additional information or context.

Takes the metadata object to enrich as an argument and modifies it in place, returning None.

class core.Feeder#

Bases: auto_archiver.core.BaseModule

Base class for implementing feeders in the media archiving framework.

Subclasses must implement the __iter__ method to define platform-specific behavior.

class core.Storage#

Bases: auto_archiver.core.BaseModule

Base class for implementing storage modules in the media archiving framework.

Subclasses must implement the get_cdn_url and uploadf methods to define their behavior.

store(media: auto_archiver.core.Media, url: str, metadata: auto_archiver.core.Metadata = None) None#
abstract get_cdn_url(media: auto_archiver.core.Media) str#

Returns the URL of the media object stored in the CDN.

abstract uploadf(file: IO[bytes], key: str, **kwargs: dict) bool#

Uploads (or saves) a file to the storage service/location.

upload(media: auto_archiver.core.Media, **kwargs) bool#
set_key(media: auto_archiver.core.Media, url, metadata: auto_archiver.core.Metadata) None#

takes the media and optionally item info and generates a key

class core.Extractor#

Bases: auto_archiver.core.BaseModule

Base class for implementing extractors in the media archiving framework. Subclasses must implement the download method to define platform-specific behavior.

valid_url: re.Pattern = None#
cleanup() None#

Called when extractors are done, or upon errors, cleanup any resources

sanitize_url(url: str) str#

Used to clean unnecessary URL parameters OR unfurl redirect links

Returns a match object if the given URL matches the valid_url pattern or False/None if not.

Normally used in the suitable method to check if the URL is supported by this extractor.

suitable(url: str) bool#

Returns True if this extractor can handle the given URL

Should be overridden by subclasses

download_from_url(url: str, to_filename: str = None, verbose=True) str#

downloads a URL to provided filename, or inferred from URL, returns local filename

abstract download(item: auto_archiver.core.Metadata) auto_archiver.core.Metadata | False#

Downloads the media from the given URL and returns a Metadata object with the downloaded media.

If the URL is not supported or the download fails, this method should return False.

class core.Formatter#

Bases: auto_archiver.core.BaseModule

Base class for implementing formatters in the media archiving framework.

Subclasses must implement the format method to define their behavior.

abstract format(item: auto_archiver.core.Metadata) auto_archiver.core.Media#

Formats a Metadata object into a user-viewable format (e.g. HTML) and stores it if needed.