utils.url#
Module Contents#
- utils.url.AUTHWALL_URLS#
- utils.url.domain_for_url(url: str) str#
SECURITY: parse the domain using urllib to avoid any potential security issues
- utils.url.clean(url: str) str#
- utils.url.is_auth_wall(url: str) bool#
checks if URL is behind an authentication wall meaning steps like wayback, wacz, … may not work
- utils.url.remove_get_parameters(url: str) str#
- utils.url.is_relevant_url(url: str) bool#
Detect if a detected media URL is recurring and therefore irrelevant to a specific archive. Useful, for example, for the enumeration of the media files in WARC files which include profile pictures, favicons, etc.
- utils.url.twitter_best_quality_url(url: str) str#
some twitter image URLs point to a less-than best quality this returns the URL pointing to the highest (original) quality