core
====

.. py:module:: core

.. autoapi-nested-parse::

   Core modules to handle things such as orchestration, metadata and configs..



Submodules
----------

.. toctree::
   :maxdepth: 1

   /autoapi/core/base_module/index
   /autoapi/core/config/index
   /autoapi/core/consts/index
   /autoapi/core/database/index
   /autoapi/core/enricher/index
   /autoapi/core/extractor/index
   /autoapi/core/feeder/index
   /autoapi/core/formatter/index
   /autoapi/core/media/index
   /autoapi/core/metadata/index
   /autoapi/core/module/index
   /autoapi/core/orchestrator/index
   /autoapi/core/storage/index
   /autoapi/core/validators/index




Package Contents
----------------

.. py:class:: Metadata

   .. py:attribute:: status
      :type:  str
      :value: 'no archiver'



   .. py:attribute:: metadata
      :type:  Dict[str, Any]


   .. py:attribute:: media
      :type:  List[core.media.Media]
      :value: []



   .. py:method:: merge(right: Metadata, overwrite_left=True) -> Metadata

      Merges another `Metadata` instance into this one.

      Conflicts are resolved based on the `overwrite_left` flag:
      - If `True`, this instance's values are overwritten by `right`.
      - If `False`, the inverse applies.



   .. py:method:: store(storages=[])


   .. py:method:: set(key: str, val: Any) -> Metadata


   .. py:method:: append(key: str, val: Any) -> Metadata


   .. py:method:: get(key: str, default: Any = None, create_if_missing=False) -> Union[Metadata, str]


   .. py:method:: success(context: str = None) -> Metadata


   .. py:method:: is_success() -> bool


   .. py:method:: is_empty() -> bool


   .. py:property:: netloc
      :type: str



   .. py:method:: set_url(url: str) -> Metadata


   .. py:method:: get_url() -> str


   .. py:method:: set_content(content: str) -> Metadata


   .. py:method:: set_title(title: str) -> Metadata


   .. py:method:: get_title() -> str


   .. py:method:: set_timestamp(timestamp: datetime.datetime) -> Metadata


   .. py:method:: get_timestamp(utc=True, iso=True) -> datetime.datetime | str | None


   .. py:method:: add_media(media: core.media.Media, id: str = None) -> Metadata


   .. py:method:: get_media_by_id(id: str, default=None) -> core.media.Media


   .. py:method:: remove_duplicate_media_by_hash() -> None


   .. py:method:: get_first_image(default=None) -> core.media.Media


   .. py:method:: set_final_media(final: core.media.Media) -> Metadata

      final media is a special type of media: if you can show only 1 this is it, it's useful for some DBs like GsheetDb



   .. py:method:: get_final_media() -> core.media.Media


   .. py:method:: get_all_media() -> List[core.media.Media]


   .. py:method:: choose_most_complete(results: List[Metadata]) -> Metadata
      :staticmethod:



   .. py:method:: set_context(key: str, val: Any) -> Metadata


   .. py:method:: get_context(key: str, default: Any = None) -> Any


.. py:class:: Media

   Represents a media file with associated properties and storage details.

   Attributes:
   - filename: The file path of the media as saved locally (temporarily, before uploading to the storage).
   - urls: A list of URLs where the media is stored or accessible.
   - properties: Additional metadata or transformations for the media.
   - _mimetype: The media's mimetype (e.g., image/jpeg, video/mp4).


   .. py:attribute:: filename
      :type:  str


   .. py:attribute:: urls
      :type:  List[str]
      :value: []



   .. py:attribute:: properties
      :type:  dict


   .. py:method:: store(metadata: Any, url: str = 'url-not-available', storages: List[Any] = None) -> None


   .. py:method:: all_inner_media(include_self=False) -> Iterator[Media]

      Retrieves all media, including nested media within properties or transformations on original media.
      This function returns a generator for all the inner media.




   .. py:method:: is_stored(in_storage) -> bool


   .. py:property:: key
      :type: str



   .. py:method:: set(key: str, value: Any) -> Media


   .. py:method:: get(key: str, default: Any = None) -> Any


   .. py:method:: add_url(url: str) -> None


   .. py:property:: mimetype
      :type: str



   .. py:method:: is_video() -> bool


   .. py:method:: is_audio() -> bool


   .. py:method:: is_image() -> bool


   .. py:method:: is_valid_video() -> bool


.. py:class:: BaseModule

   Bases: :py:obj:`abc.ABC`


   Base module class. All modules should inherit from this class.

   The exact methods a class implements will depend on the type of module it is,
   however modules can have a .setup() method to run any setup code
   (e.g. logging in to a site, spinning up a browser etc.)

   See consts.MODULE_TYPES for the types of modules you can create, noting that
   a subclass can be of multiple types. For example, a module that extracts data from
   a website and stores it in a database would be both an 'extractor' and a 'database' module.

   Each module is a python package, and should have a __manifest__.py file in the
   same directory as the module file. The __manifest__.py specifies the module information
   like name, author, version, dependencies etc. See DEFAULT_MANIFEST for the
   default manifest structure.



   .. py:attribute:: MODULE_TYPES
      :value: ['feeder', 'extractor', 'enricher', 'database', 'storage', 'formatter']



   .. py:attribute:: config
      :type:  Mapping[str, Any]


   .. py:attribute:: authentication
      :type:  Mapping[str, Mapping[str, str]]


   .. py:attribute:: name
      :type:  str


   .. py:attribute:: module_factory
      :type:  core.module.ModuleFactory


   .. py:attribute:: tmp_dir
      :type:  tempfile.TemporaryDirectory
      :value: None



   .. py:property:: storages
      :type: list



   .. py:method:: config_setup(config: dict)


   .. py:method:: setup()


   .. py:method:: auth_for_site(site: str, extract_cookies=True) -> Mapping[str, Any]

      Returns the authentication information for a given site. This is used to authenticate
      with a site before extracting data. The site should be the domain of the site, e.g. 'twitter.com'

      :param site: the domain of the site to get authentication information for
      :param extract_cookies: whether or not to extract cookies from the given browser/file and return the cookie jar (disabling can speed up processing if you don't actually need the cookies jar).

      :returns: authdict dict -> {
          "username": str,
          "password": str,
          "api_key": str,
          "api_secret": str,
          "cookie": str,
          "cookies_file": str,
          "cookies_from_browser": str,
          "cookies_jar": CookieJar
      }

      **Global options:**

      * cookies_from_browser: str - the name of the browser to extract cookies from (e.g. 'chrome', 'firefox' - uses ytdlp under the hood to extract

      * cookies_file: str - the path to a cookies file to use for login


      **Currently, the sites dict can have keys of the following types:**

      * username: str - the username to use for login

      * password: str - the password to use for login

      * api_key: str - the API key to use for login

      * api_secret: str - the API secret to use for login

      * cookie: str - a cookie string to use for login (specific to this site)

      * cookies_file: str - the path to a cookies file to use for login (specific to this site)

      * cookies_from_browser: str - the name of the browser to extract cookies from (specitic for this site)





   .. py:method:: repr()


.. py:class:: Database

   Bases: :py:obj:`auto_archiver.core.BaseModule`


   Base class for implementing database modules in the media archiving framework.

   Subclasses must implement the `fetch` and `done` methods to define platform-specific behavior.


   .. py:method:: started(item: auto_archiver.core.Metadata) -> None

      signals the DB that the given item archival has started



   .. py:method:: failed(item: auto_archiver.core.Metadata, reason: str) -> None

      update DB accordingly for failure



   .. py:method:: aborted(item: auto_archiver.core.Metadata) -> None

      abort notification if user cancelled after start



   .. py:method:: fetch(item: auto_archiver.core.Metadata) -> Union[auto_archiver.core.Metadata, bool]

      check and fetch if the given item has been archived already, each database should handle its own caching, and configuration mechanisms



   .. py:method:: done(item: auto_archiver.core.Metadata, cached: bool = False) -> None
      :abstractmethod:


      archival result ready - should be saved to DB



.. py:class:: Enricher

   Bases: :py:obj:`auto_archiver.core.BaseModule`


   Base classes and utilities for enrichers in the Auto Archiver system.

   Enricher modules must implement the `enrich` method to define their behavior.


   .. py:method:: enrich(to_enrich: auto_archiver.core.Metadata) -> None
      :abstractmethod:


      Enriches a Metadata object with additional information or context.

      Takes the metadata object to enrich as an argument and modifies it in place, returning None.



.. py:class:: Feeder

   Bases: :py:obj:`auto_archiver.core.BaseModule`


   Base class for implementing feeders in the media archiving framework.

   Subclasses must implement the `__iter__` method to define platform-specific behavior.


.. py:class:: Storage

   Bases: :py:obj:`auto_archiver.core.BaseModule`


   Base class for implementing storage modules in the media archiving framework.

   Subclasses must implement the `get_cdn_url` and `uploadf` methods to define their behavior.


   .. py:method:: store(media: auto_archiver.core.Media, url: str, metadata: auto_archiver.core.Metadata = None) -> None


   .. py:method:: get_cdn_url(media: auto_archiver.core.Media) -> str
      :abstractmethod:


      Returns the URL of the media object stored in the CDN.



   .. py:method:: uploadf(file: IO[bytes], key: str, **kwargs: dict) -> bool
      :abstractmethod:


      Uploads (or saves) a file to the storage service/location.

      This method should not be called directly, but instead through the 'store' method,
      which sets up the media for storage.



   .. py:method:: upload(media: auto_archiver.core.Media, **kwargs) -> bool

      Uploads a media object to the storage service.

      This method should not be called directly, but instead be called through the 'store' method,
      which sets up the media for storage.



   .. py:method:: set_key(media: auto_archiver.core.Media, url: str, metadata: auto_archiver.core.Metadata) -> None

      takes the media and optionally item info and generates a key



.. py:class:: Extractor

   Bases: :py:obj:`auto_archiver.core.BaseModule`


   Base class for implementing extractors in the media archiving framework.
   Subclasses must implement the `download` method to define platform-specific behavior.


   .. py:attribute:: valid_url
      :type:  re.Pattern
      :value: None



   .. py:method:: cleanup() -> None

      Called when extractors are done, or upon errors, cleanup any resources



   .. py:method:: sanitize_url(url: str) -> str

      Used to clean unnecessary URL parameters OR unfurl redirect links



   .. py:method:: match_link(url: str) -> re.Match

      Returns a match object if the given URL matches the valid_url pattern or False/None if not.

      Normally used in the `suitable` method to check if the URL is supported by this extractor.




   .. py:method:: suitable(url: str) -> bool

      Returns True if this extractor can handle the given URL

      Should be overridden by subclasses




   .. py:method:: download_from_url(url: str, to_filename: str = None, verbose=True) -> str

      downloads a URL to provided filename, or inferred from URL, returns local filename



   .. py:method:: download(item: auto_archiver.core.Metadata) -> auto_archiver.core.Metadata | False
      :abstractmethod:


      Downloads the media from the given URL and returns a Metadata object with the downloaded media.

      If the URL is not supported or the download fails, this method should return False.




.. py:class:: Formatter

   Bases: :py:obj:`auto_archiver.core.BaseModule`


   Base class for implementing formatters in the media archiving framework.

   Subclasses must implement the `format` method to define their behavior.


   .. py:method:: format(item: auto_archiver.core.Metadata) -> auto_archiver.core.Media
      :abstractmethod:


      Formats a Metadata object into a user-viewable format (e.g. HTML) and stores it if needed.



