utils.deletion_detection#
Deletion Detection Utilities
Provides a best-effort detection of deleted, missing, or unavailable content across various social media platforms based on presence of expected keywords.
This module helps identify removed content, helps to: - Document content that existed but was deleted - Track patterns of content removal - Preserve metadata about missing content
Module Contents#
- class utils.deletion_detection.DeletionIndicators#
Platform-specific indicators that content has been deleted or is unavailable, alongside generic indicators.
- TWITTER = ["Hmm...this page doesn't exist", 'Try searching for something else', 'This Tweet is...#
- FACEBOOK = ["This content isn't available", "Sorry, this content isn't available", 'This content is no...#
- INSTAGRAM = ["Sorry, this page isn't available", 'The link you followed may be broken', 'Media not found or...#
- TIKTOK = ["Couldn't find this account", 'This video is no longer available', 'This video is currently...#
- YOUTUBE = ["This video isn't available anymore", 'Video unavailable', 'This video has been removed', 'This...#
- REDDIT = ['this post has been removed', 'this comment has been removed', '[removed]', '[deleted]', 'page...#
- VK = ['Post deleted', 'Page not found', 'Content unavailable', 'Access denied']#
- TELEGRAM = ['Message not found', 'Deleted message', 'Channel is private']#
- GENERIC = ['has been removed', 'no longer available', 'content removed', 'access denied', 'page not found']#
- classmethod all_indicators() List[str]#
Returns all deletion indicators from all platforms.
- classmethod for_url(url: str) List[str]#
Returns platform-specific indicators based on URL domain.
- utils.deletion_detection.detect_deletion(html_content: str = None, page_title: str = None, error_message: str = None, url: str = None, video_data: dict = None) Dict[str, any] | None#
Best-effort deletion detection across multiple signals.
Checks HTML content, page titles, error messages, and video metadata for indicators that content has been deleted or is unavailable.
- Parameters:
html_content – Raw HTML source of the page
page_title – Browser page title
error_message – Any error message from the extractor
url – The URL being archived (for platform-specific detection)
video_data – Video metadata from yt-dlp or other extractors
- Returns:
Dictionary with deletion details if detected, None otherwise. Format: {
”is_deleted”: True, “indicator”: “specific text that was found”, “source”: “html|title|error|metadata”, “platform”: “twitter|facebook|etc”
}
- utils.deletion_detection.flag_as_deleted(metadata, deletion_info: Dict[str, any]) None#
Flags metadata object as deleted/unavailable. Adds tentative deletion information to the metadata object.
- Parameters:
metadata – Metadata object to update
deletion_info – Dictionary from detect_deletion()