antibot_extractor_enricher.dropin#
Module Contents#
- class antibot_extractor_enricher.dropin.Dropin(sb: seleniumbase.SB, extractor: auto_archiver.core.Extractor)#
A class to handle drop-in functionality for the antibot extractor enricher module. This class is designed to be a base class for drop-ins that can handle specific websites.
- static documentation() Mapping[str, str]#
Each Dropin should auto-document itself with this method. Return dictionary can include: - ‘name’: A string representing the name of the dropin. - ‘description’: A string describing the functionality of the dropin. - ‘site’: A string representing the site this dropin is for. - ‘authentication’: A dictionary with authentication example for the site.
- sb: seleniumbase.SB#
- extractor: auto_archiver.core.Extractor#
- static suitable(url: str) bool#
- Abstractmethod:
Check if the URL is suitable for processing with this dropin. :param url: The URL to check. :return: True if the URL is suitable for processing, False otherwise.
- static sanitize_url(url: str) str#
Used to clean URLs before processing them.
- static images_selectors() str#
CSS selector to find images in the HTML page
- static video_selectors() str#
CSS selector to find videos in the HTML page.
- js_for_image_css_selectors() str#
A configurable JS script that receives a css selector from the dropin itself and returns an array of Image elements according to the selection.
You can overwrite this instead of images_selector for more control over scraped images.
- js_for_video_css_selectors() str#
A configurable JS script that receives a css selector from the dropin itself and returns an array of Video elements according to the selection.
You can overwrite this instead of video_selector for more control over scraped videos.
- abstract open_page(url) bool#
Make sure the page is opened, even if it requires authentication, captcha solving, etc. :param url: The URL to open. :return: True if success, False otherwise.
- add_extra_media(to_enrich: auto_archiver.core.Metadata) tuple[int, int]#
Extract image and/or video data from the currently open post with SeleniumBase. Media is added to the to_enrich Metadata object. :return: A tuple (number of Images added, number of Videos added).