antibot_extractor_enricher.dropin#

Module Contents#

class antibot_extractor_enricher.dropin.Dropin(sb: seleniumbase.SB, extractor: auto_archiver.core.Extractor)#

A class to handle drop-in functionality for the antibot extractor enricher module. This class is designed to be a base class for drop-ins that can handle specific websites.

static documentation() → Mapping[str, str]#: Each Dropin should auto-document itself with this method. Return dictionary can include: - ‘name’: A string representing the name of the dropin. - ‘description’: A string describing the functionality of the dropin. - ‘site’: A string representing the site this dropin is for. - ‘authentication’: A dictionary with authentication example for the site.

sb: seleniumbase.SB#

extractor: auto_archiver.core.Extractor#

static suitable(url: str) → bool#

Abstractmethod:

Check if the URL is suitable for processing with this dropin. :param url: The URL to check. :return: True if the URL is suitable for processing, False otherwise.

static sanitize_url(url: str) → str#: Used to clean URLs before processing them.

static images_selectors() → str#: CSS selector to find images in the HTML page

static video_selectors() → str#: CSS selector to find videos in the HTML page.

js_for_image_css_selectors() → str#

A configurable JS script that receives a css selector from the dropin itself and returns an array of Image elements according to the selection.

You can overwrite this instead of images_selector for more control over scraped images.

js_for_video_css_selectors() → str#

A configurable JS script that receives a css selector from the dropin itself and returns an array of Video elements according to the selection.

You can overwrite this instead of video_selector for more control over scraped videos.

abstract open_page(url) → bool#: Make sure the page is opened, even if it requires authentication, captcha solving, etc. :param url: The URL to open. :return: True if success, False otherwise.

add_extra_media(to_enrich: auto_archiver.core.Metadata) → tuple[int, int]#: Extract image and/or video data from the currently open post with SeleniumBase. Media is added to the to_enrich Metadata object. :return: A tuple (number of Images added, number of Videos added).