antibot_extractor_enricher.dropin#

Module Contents#

class antibot_extractor_enricher.dropin.Dropin(sb: seleniumbase.SB, extractor: auto_archiver.core.Extractor)#

A class to handle drop-in functionality for the antibot extractor enricher module. This class is designed to be a base class for drop-ins that can handle specific websites.

static documentation() → Mapping[str, str]#: Each Dropin should auto-document itself with this method. Return dictionary can include: - ‘name’: A string representing the name of the dropin. - ‘description’: A string describing the functionality of the dropin. - ‘site’: A string representing the site this dropin is for. - ‘authentication’: A dictionary with authentication example for the site.

sb: seleniumbase.SB#

extractor: auto_archiver.core.Extractor#

static suitable(url: str) → bool#

Abstractmethod:

Check if the URL is suitable for processing with this dropin. :param url: The URL to check. :return: True if the URL is suitable for processing, False otherwise.

static sanitize_url(url: str) → str#: Used to clean URLs before processing them.

static images_selectors() → str#: CSS selector to find images in the HTML page

static video_selectors() → str#: CSS selector to find videos in the HTML page.

js_for_image_css_selectors() → str#

A configurable JS script that receives a css selector from the dropin itself and returns an array of Image elements according to the selection.

You can overwrite this instead of images_selector for more control over scraped images.

js_for_video_css_selectors() → str#

A configurable JS script that receives a css selector from the dropin itself and returns an array of Video elements according to the selection.

You can overwrite this instead of video_selector for more control over scraped videos.

abstract open_page(url) → bool#: Make sure the page is opened, even if it requires authentication, captcha solving, etc. :param url: The URL to open. :return: True if success, False otherwise.

add_extra_media(to_enrich: auto_archiver.core.Metadata) → tuple[int, int]#: Extract image and/or video data from the currently open post with SeleniumBase. Media is added to the to_enrich Metadata object. :return: A tuple (number of Images added, number of Videos added).

hit_auth_wall() → bool#: Custom check to see if the current page is behind an authentication wall, if True is returned the default global auth wall detector is used instead. If false, no auth wall is detected and the page is considered open.