utils.url#

Module Contents#

utils.url.AUTHWALL_URLS#
utils.url.check_url_or_raise(url: str) bool | ValueError#

Blocks localhost, private, reserved, and link-local IPs and all non-http/https schemes.

utils.url.domain_for_url(url: str) str#

SECURITY: parse the domain using urllib to avoid any potential security issues

utils.url.clean(url: str) str#
utils.url.is_auth_wall(url: str) bool#

checks if URL is behind an authentication wall meaning steps like wayback, wacz, … may not work

utils.url.remove_get_parameters(url: str) str#
utils.url.is_relevant_url(url: str) bool#

Detect if a detected media URL is recurring and therefore irrelevant to a specific archive. Useful, for example, for the enumeration of the media files in WARC files which include profile pictures, favicons, etc.

utils.url.twitter_best_quality_url(url: str) str#

some twitter image URLs point to a less-than best quality this returns the URL pointing to the highest (original) quality