utils.url
=========

.. py:module:: utils.url






Module Contents
---------------

.. py:data:: AUTHWALL_URLS

.. py:function:: check_url_or_raise(url: str) -> bool | ValueError

   Blocks localhost, private, reserved, and link-local IPs and all non-http/https schemes.


.. py:function:: domain_for_url(url: str) -> str

   SECURITY: parse the domain using urllib to avoid any potential security issues


.. py:function:: clean(url: str) -> str

.. py:function:: is_auth_wall(url: str) -> bool

   checks if URL is behind an authentication wall meaning steps like wayback, wacz, ... may not work


.. py:function:: remove_get_parameters(url: str) -> str

.. py:function:: is_relevant_url(url: str) -> bool

   Detect if a detected media URL is recurring and therefore irrelevant to a specific archive. Useful, for example, for the enumeration of the media files in WARC files which include profile pictures, favicons, etc.


.. py:function:: twitter_best_quality_url(url: str) -> str

   some twitter image URLs point to a less-than best quality
   this returns the URL pointing to the highest (original) quality


