API Docs

domain_utils.domain_utils module

domain_utils.domain_utils.get_etld1(url, **kwargs)[source]

Returns the eTLD+1 (aka PS+1) of the url.

Parameters
  • url (string) – The url from which to extract the eTLD+1 / PS+1

  • extractor (tldextract::TLDExtract, optional) – An (optional) tldextract::TLDExtract instance can be passed with keyword extractor, otherwise we create and update one automatically.

  • kwargs – The method preprocesses the url with stem_url before extracting the domain. You can pass in stem_url parameters if you wish to change the behavior in some specific way.

Returns

The eTLD+1 / PS+1 of the url passed in. If no eTLD+1 is detectable, an empty string will be returned. Returns an IP address if the hostname of the url is a valid IP address.

Return type

string

domain_utils.domain_utils.get_port(url, extractor=None)[source]

Given an url, extract from it port if present.

Parameters
  • url (string) – The URL from where we want to get the scheme

  • extractor (tldextract::TLDExtract, optional) – An (optional) tldextract::TLDExtract instance can be passed with keyword extractor, otherwise we create and update one automatically.

Returns

Returns port in the url. If port not found, returns None.

Return type

int

domain_utils.domain_utils.get_ps_plus_1(url, **kwargs)[source]

An alias for get_etld1.

domain_utils.domain_utils.get_scheme(url, no_scheme='no_scheme')[source]

Given an url, extract from it the scheme.

Parameters
  • url (string) – The URL from where we want to get the scheme

  • no_scheme (any) – The value to use if no scheme is detected. Default is no_scheme

Returns

Returns the scheme with a default of ‘blank’ if no schema is provided

Return type

string

domain_utils.domain_utils.get_stripped_url(url, **kwargs)[source]

Alias for stem_url.

domain_utils.domain_utils.hostname_subparts(url, include_ps=False, **kwargs)[source]

Returns a list of slices of a url’s hostname down to the eTLD+1 / PS+1.

Parameters
  • url (string) – The url from which to extract the hostname parts

  • include_ps (boolean, optional) –

    If include_ps is set, the hostname slices will include the public suffix For example: http://a.b.c.d.com/path?query#frag would yield:

    • ["a.b.c.d.com", "b.c.d.com", "c.d.com", "d.com"] if include_ps == False

    • ["a.b.c.d.com", "b.c.d.com", "c.d.com", "d.com", "com"] if include_ps == True

  • kwargs – Additionally all kwargs for get_etld1, can be passed to this method.

Returns

List of slices of of a url’s hostname down to the eTLD+1 / PS+1.

Return type

list (string)

domain_utils.domain_utils.is_ip_address(hostname)[source]

Check if the given string is a valid IP address

domain_utils.domain_utils.stem_url(url, return_unparsed=True, scheme_default='http', parse_ws=True, scheme=False, path=True, use_netloc=True, extractor=None)[source]

Returns a url stripped to just the beginning and end.

More formally it returns (scheme)?+(netloc|hostname)+(path)?.

For example https://my.domain.net/a/path/to/a/file.html#anchor?a=1 becomes my.domain.net/a/path/to/a/file.html URL parsing is done using std lib urllib.parse.urlparse.

A url is parsed if it has a qualifying scheme. The qualifying schemes are http, https, ws and wss. Websocket schemes can be omitted using the parse_ws parameter. Additionally, the scheme_default parameter provides a scheme where the url doesn’t contain one. The default is http and so urls without a scheme will, by default, be considered as http and therfore parsed.

What is returned for unparsed urls is determined by the return_unparsed parameter.

Parameters
  • url (string) – The URL to be parsed

  • return_unparsed (boolean, optional) – Action to take if scheme is not parsed e.g. file: or about:blank. If False, the result for non parsed urls will be an empty string If True, the result will be the original url, e.g. about:blank -> about:blank even if scheme=False. See method description to understand whether a URL is parsed or not. Default is True.

  • scheme_default (string, optional) – This parameter is passed to scheme parameter of urllib.parse.urlparse. This causes urls without a scheme to return the scheme default. Default is http.

  • parse_ws (boolean, optional) – If True, then ws and wss urls are parsed. Default is True.

  • scheme (boolean, optional) – If True, scheme will be prepended in parsed result. Default is False.

  • path (boolean, optional) – If True, path will be included in parsed result. Default is True.

  • use_netloc (boolean, optional) – If True urlparse’s netloc will be used. If False urlparse’s host will be returned. Using netloc means that a port is included, for example, if it was in the path. Default is True.

  • extractor (tldextract::TLDExtract, optional) – An (optional) tldextract::TLDExtract instance can be passed with keyword extractor, otherwise we create and update one automatically.

Returns

Returns a url stripped to (scheme)?+(netloc|hostname)+(path)?. Returns empty string if appropriate.

Return type

string