core.base_module#

Module Contents#

class core.base_module.BaseModule#

Bases: abc.ABC

Base module class. All modules should inherit from this class.

The exact methods a class implements will depend on the type of module it is, however modules can have a .setup() method to run any setup code (e.g. logging in to a site, spinning up a browser etc.)

See consts.MODULE_TYPES for the types of modules you can create, noting that a subclass can be of multiple types. For example, a module that extracts data from a website and stores it in a database would be both an ‘extractor’ and a ‘database’ module.

Each module is a python package, and should have a __manifest__.py file in the same directory as the module file. The __manifest__.py specifies the module information like name, author, version, dependencies etc. See DEFAULT_MANIFEST for the default manifest structure.

MODULE_TYPES = ['feeder', 'extractor', 'enricher', 'database', 'storage', 'formatter']#
config: Mapping[str, Any]#
authentication: Mapping[str, Mapping[str, str]]#
name: str#
module_factory: core.module.ModuleFactory#
tmp_dir: tempfile.TemporaryDirectory = None#
property storages: list#
config_setup(config: dict)#
setup()#
auth_for_site(site: str, extract_cookies=True) Mapping[str, Any]#

Returns the authentication information for a given site. This is used to authenticate with a site before extracting data. The site should be the domain of the site, e.g. ‘twitter.com’

Parameters:
  • site – the domain of the site to get authentication information for

  • extract_cookies – whether or not to extract cookies from the given browser/file and return the cookie jar (disabling can speed up processing if you don’t actually need the cookies jar).

Returns:

authdict dict -> { “username”: str, “password”: str, “api_key”: str, “api_secret”: str, “cookie”: str, “cookies_file”: str, “cookies_from_browser”: str, “cookies_jar”: CookieJar

}

Global options:

  • cookies_from_browser: str - the name of the browser to extract cookies from (e.g. ‘chrome’, ‘firefox’ - uses ytdlp under the hood to extract

  • cookies_file: str - the path to a cookies file to use for login

Currently, the sites dict can have keys of the following types:

  • username: str - the username to use for login

  • password: str - the password to use for login

  • api_key: str - the API key to use for login

  • api_secret: str - the API secret to use for login

  • cookie: str - a cookie string to use for login (specific to this site)

  • cookies_file: str - the path to a cookies file to use for login (specific to this site)

  • cookies_from_browser: str - the name of the browser to extract cookies from (specitic for this site)

repr()#