core
====

.. py:module:: core

.. autoapi-nested-parse::

   Core modules to handle things such as orchestration, metadata and configs..



Submodules
----------

.. toctree::
   :maxdepth: 1

   /autoapi/core/base_module/index
   /autoapi/core/config/index
   /autoapi/core/database/index
   /autoapi/core/enricher/index
   /autoapi/core/extractor/index
   /autoapi/core/feeder/index
   /autoapi/core/formatter/index
   /autoapi/core/media/index
   /autoapi/core/metadata/index
   /autoapi/core/module/index
   /autoapi/core/orchestrator/index
   /autoapi/core/storage/index
   /autoapi/core/validators/index




Package Contents
----------------

.. py:class:: Metadata

   .. py:attribute:: status
      :type:  str
      :value: 'no archiver'



   .. py:attribute:: metadata
      :type:  Dict[str, Any]


   .. py:attribute:: media
      :type:  List[core.media.Media]
      :value: []



   .. py:method:: merge(right: Metadata, overwrite_left=True) -> Metadata

      Merges another `Metadata` instance into this one.

      Conflicts are resolved based on the `overwrite_left` flag:
      - If `True`, this instance's values are overwritten by `right`.
      - If `False`, the inverse applies.



   .. py:method:: store(storages=[])


   .. py:method:: set(key: str, val: Any) -> Metadata


   .. py:method:: append(key: str, val: Any) -> Metadata


   .. py:method:: get(key: str, default: Any = None, create_if_missing=False) -> Union[Metadata, str]


   .. py:method:: success(context: str = None) -> Metadata


   .. py:method:: is_success() -> bool


   .. py:method:: is_empty() -> bool


   .. py:property:: netloc
      :type: str



   .. py:method:: set_url(url: str) -> Metadata


   .. py:method:: get_url() -> str


   .. py:method:: set_content(content: str) -> Metadata


   .. py:method:: set_title(title: str) -> Metadata


   .. py:method:: get_title() -> str


   .. py:method:: set_timestamp(timestamp: datetime.datetime) -> Metadata


   .. py:method:: get_timestamp(utc=True, iso=True) -> datetime.datetime


   .. py:method:: add_media(media: core.media.Media, id: str = None) -> Metadata


   .. py:method:: get_media_by_id(id: str, default=None) -> core.media.Media


   .. py:method:: remove_duplicate_media_by_hash() -> None


   .. py:method:: get_first_image(default=None) -> core.media.Media


   .. py:method:: set_final_media(final: core.media.Media) -> Metadata

      final media is a special type of media: if you can show only 1 this is it, it's useful for some DBs like GsheetDb



   .. py:method:: get_final_media() -> core.media.Media


   .. py:method:: get_all_media() -> List[core.media.Media]


   .. py:method:: choose_most_complete(results: List[Metadata]) -> Metadata
      :staticmethod:



   .. py:method:: set_context(key: str, val: Any) -> Metadata


   .. py:method:: get_context(key: str, default: Any = None) -> Any


.. py:class:: Media

   Represents a media file with associated properties and storage details.

   Attributes:
   - filename: The file path of the media.
   - key: An optional identifier for the media.
   - urls: A list of URLs where the media is stored or accessible.
   - properties: Additional metadata or transformations for the media.
   - _mimetype: The media's mimetype (e.g., image/jpeg, video/mp4).


   .. py:attribute:: filename
      :type:  str


   .. py:attribute:: key
      :type:  str
      :value: None



   .. py:attribute:: urls
      :type:  List[str]
      :value: []



   .. py:attribute:: properties
      :type:  dict


   .. py:method:: store(metadata: Any, url: str = 'url-not-available', storages: List[Any] = None) -> None


   .. py:method:: all_inner_media(include_self=False)

      Retrieves all media, including nested media within properties or transformations on original media.
      This function returns a generator for all the inner media.




   .. py:method:: is_stored(in_storage) -> bool


   .. py:method:: set(key: str, value: Any) -> Media


   .. py:method:: get(key: str, default: Any = None) -> Any


   .. py:method:: add_url(url: str) -> None


   .. py:property:: mimetype
      :type: str



   .. py:method:: is_video() -> bool


   .. py:method:: is_audio() -> bool


   .. py:method:: is_image() -> bool


   .. py:method:: is_valid_video() -> bool


.. py:class:: BaseModule

   Bases: :py:obj:`abc.ABC`


   Base module class. All modules should inherit from this class.

   The exact methods a class implements will depend on the type of module it is,
   however modules can have a .setup() method to run any setup code
   (e.g. logging in to a site, spinning up a browser etc.)

   See BaseModule.MODULE_TYPES for the types of modules you can create, noting that
   a subclass can be of multiple types. For example, a module that extracts data from
   a website and stores it in a database would be both an 'extractor' and a 'database' module.

   Each module is a python package, and should have a __manifest__.py file in the
   same directory as the module file. The __manifest__.py specifies the module information
   like name, author, version, dependencies etc. See BaseModule._DEFAULT_MANIFEST for the
   default manifest structure.



   .. py:attribute:: MODULE_TYPES
      :value: ['feeder', 'extractor', 'enricher', 'database', 'storage', 'formatter']



   .. py:attribute:: config
      :type:  Mapping[str, Any]


   .. py:attribute:: authentication
      :type:  Mapping[str, Mapping[str, str]]


   .. py:attribute:: name
      :type:  str


   .. py:attribute:: tmp_dir
      :type:  tempfile.TemporaryDirectory
      :value: None



   .. py:property:: storages
      :type: list



   .. py:method:: config_setup(config: dict)


   .. py:method:: setup()


   .. py:method:: auth_for_site(site: str, extract_cookies=True) -> Mapping[str, Any]

      Returns the authentication information for a given site. This is used to authenticate
      with a site before extracting data. The site should be the domain of the site, e.g. 'twitter.com'

      extract_cookies: bool - whether or not to extract cookies from the given browser and return the
      cookie jar (disabling can speed up) processing if you don't actually need the cookies jar

      Currently, the dict can have keys of the following types:
      - username: str - the username to use for login
      - password: str - the password to use for login
      - api_key: str - the API key to use for login
      - api_secret: str - the API secret to use for login
      - cookie: str - a cookie string to use for login (specific to this site)
      - cookies_jar: YoutubeDLCookieJar | http.cookiejar.MozillaCookieJar - a cookie jar compatible with requests (e.g. `requests.get(cookies=cookie_jar)`)



   .. py:method:: repr()


.. py:class:: Database

   Bases: :py:obj:`auto_archiver.core.BaseModule`


   Base class for implementing database modules in the media archiving framework.

   Subclasses must implement the `fetch` and `done` methods to define platform-specific behavior.


   .. py:method:: started(item: auto_archiver.core.Metadata) -> None

      signals the DB that the given item archival has started



   .. py:method:: failed(item: auto_archiver.core.Metadata, reason: str) -> None

      update DB accordingly for failure



   .. py:method:: aborted(item: auto_archiver.core.Metadata) -> None

      abort notification if user cancelled after start



   .. py:method:: fetch(item: auto_archiver.core.Metadata) -> Union[auto_archiver.core.Metadata, bool]

      check and fetch if the given item has been archived already, each database should handle its own caching, and configuration mechanisms



   .. py:method:: done(item: auto_archiver.core.Metadata, cached: bool = False) -> None
      :abstractmethod:


      archival result ready - should be saved to DB



.. py:class:: Enricher

   Bases: :py:obj:`auto_archiver.core.BaseModule`


   Base classes and utilities for enrichers in the Auto-Archiver system.

   Enricher modules must implement the `enrich` method to define their behavior.


   .. py:method:: enrich(to_enrich: auto_archiver.core.Metadata) -> None
      :abstractmethod:


      Enriches a Metadata object with additional information or context.

      Takes the metadata object to enrich as an argument and modifies it in place, returning None.



.. py:class:: Feeder

   Bases: :py:obj:`auto_archiver.core.BaseModule`


   Base class for implementing feeders in the media archiving framework.

   Subclasses must implement the `__iter__` method to define platform-specific behavior.


.. py:class:: Storage

   Bases: :py:obj:`auto_archiver.core.BaseModule`


   Base class for implementing storage modules in the media archiving framework.

   Subclasses must implement the `get_cdn_url` and `uploadf` methods to define their behavior.


   .. py:method:: store(media: auto_archiver.core.Media, url: str, metadata: auto_archiver.core.Metadata = None) -> None


   .. py:method:: get_cdn_url(media: auto_archiver.core.Media) -> str
      :abstractmethod:


      Returns the URL of the media object stored in the CDN.



   .. py:method:: uploadf(file: IO[bytes], key: str, **kwargs: dict) -> bool
      :abstractmethod:


      Uploads (or saves) a file to the storage service/location.



   .. py:method:: upload(media: auto_archiver.core.Media, **kwargs) -> bool


   .. py:method:: set_key(media: auto_archiver.core.Media, url, metadata: auto_archiver.core.Metadata) -> None

      takes the media and optionally item info and generates a key



.. py:class:: Extractor

   Bases: :py:obj:`auto_archiver.core.BaseModule`


   Base class for implementing extractors in the media archiving framework.
   Subclasses must implement the `download` method to define platform-specific behavior.


   .. py:attribute:: valid_url
      :type:  re.Pattern
      :value: None



   .. py:method:: cleanup() -> None

      Called when extractors are done, or upon errors, cleanup any resources



   .. py:method:: sanitize_url(url: str) -> str

      Used to clean unnecessary URL parameters OR unfurl redirect links



   .. py:method:: match_link(url: str) -> re.Match

      Returns a match object if the given URL matches the valid_url pattern or False/None if not.

      Normally used in the `suitable` method to check if the URL is supported by this extractor.




   .. py:method:: suitable(url: str) -> bool

      Returns True if this extractor can handle the given URL

      Should be overridden by subclasses




   .. py:method:: download_from_url(url: str, to_filename: str = None, verbose=True) -> str

      downloads a URL to provided filename, or inferred from URL, returns local filename



   .. py:method:: download(item: auto_archiver.core.Metadata) -> auto_archiver.core.Metadata | False
      :abstractmethod:


      Downloads the media from the given URL and returns a Metadata object with the downloaded media.

      If the URL is not supported or the download fails, this method should return False.




.. py:class:: Formatter

   Bases: :py:obj:`auto_archiver.core.BaseModule`


   Base class for implementing formatters in the media archiving framework.

   Subclasses must implement the `format` method to define their behavior.


   .. py:method:: format(item: auto_archiver.core.Metadata) -> auto_archiver.core.Media
      :abstractmethod:


      Formats a Metadata object into a user-viewable format (e.g. HTML) and stores it if needed.



