API Docs¶
domain_utils.domain_utils module¶
-
domain_utils.domain_utils.
get_etld1
(url, **kwargs)[source]¶ Returns the eTLD+1 (aka PS+1) of the url.
- Parameters
url (string) – The url from which to extract the eTLD+1 / PS+1
extractor (tldextract::TLDExtract, optional) – An (optional) tldextract::TLDExtract instance can be passed with keyword extractor, otherwise we create and update one automatically.
kwargs – The method preprocesses the url with
stem_url
before extracting the domain. You can pass instem_url
parameters if you wish to change the behavior in some specific way.
- Returns
The eTLD+1 / PS+1 of the url passed in. If no eTLD+1 is detectable, an empty string will be returned. Returns an IP address if the hostname of the url is a valid IP address.
- Return type
string
-
domain_utils.domain_utils.
get_port
(url, extractor=None)[source]¶ Given an url, extract from it port if present.
- Parameters
url (string) – The URL from where we want to get the scheme
extractor (tldextract::TLDExtract, optional) – An (optional) tldextract::TLDExtract instance can be passed with keyword extractor, otherwise we create and update one automatically.
- Returns
Returns port in the url. If port not found, returns
None
.- Return type
int
-
domain_utils.domain_utils.
get_scheme
(url, no_scheme='no_scheme')[source]¶ Given an url, extract from it the scheme.
- Parameters
url (string) – The URL from where we want to get the scheme
no_scheme (any) – The value to use if no scheme is detected. Default is
no_scheme
- Returns
Returns the scheme with a default of ‘blank’ if no schema is provided
- Return type
string
-
domain_utils.domain_utils.
hostname_subparts
(url, include_ps=False, **kwargs)[source]¶ Returns a list of slices of a url’s hostname down to the eTLD+1 / PS+1.
- Parameters
url (string) – The url from which to extract the hostname parts
include_ps (boolean, optional) –
If
include_ps
is set, the hostname slices will include the public suffix For example:http://a.b.c.d.com/path?query#frag
would yield:["a.b.c.d.com", "b.c.d.com", "c.d.com", "d.com"]
ifinclude_ps == False
["a.b.c.d.com", "b.c.d.com", "c.d.com", "d.com", "com"]
ifinclude_ps == True
kwargs – Additionally all kwargs for get_etld1, can be passed to this method.
- Returns
List of slices of of a url’s hostname down to the eTLD+1 / PS+1.
- Return type
list (string)
-
domain_utils.domain_utils.
is_ip_address
(hostname)[source]¶ Check if the given string is a valid IP address
-
domain_utils.domain_utils.
stem_url
(url, return_unparsed=True, scheme_default='http', parse_ws=True, scheme=False, path=True, use_netloc=True, extractor=None)[source]¶ Returns a url stripped to just the beginning and end.
More formally it returns
(scheme)?+(netloc|hostname)+(path)?
.For example
https://my.domain.net/a/path/to/a/file.html#anchor?a=1
becomesmy.domain.net/a/path/to/a/file.html
URL parsing is done using std lib urllib.parse.urlparse.A url is parsed if it has a qualifying scheme. The qualifying schemes are
http
,https
,ws
andwss
. Websocket schemes can be omitted using theparse_ws
parameter. Additionally, thescheme_default
parameter provides a scheme where the url doesn’t contain one. The default ishttp
and so urls without a scheme will, by default, be considered as http and therfore parsed.What is returned for unparsed urls is determined by the
return_unparsed
parameter.- Parameters
url (string) – The URL to be parsed
return_unparsed (boolean, optional) – Action to take if scheme is not parsed e.g.
file:
orabout:blank
. IfFalse
, the result for non parsed urls will be an empty string IfTrue
, the result will be the original url, e.g.about:blank
->about:blank
even ifscheme=False
. See method description to understand whether a URL is parsed or not. Default isTrue
.scheme_default (string, optional) – This parameter is passed to scheme parameter of urllib.parse.urlparse. This causes urls without a scheme to return the scheme default. Default is
http
.parse_ws (boolean, optional) – If
True
, thenws
andwss
urls are parsed. Default isTrue
.scheme (boolean, optional) – If
True
, scheme will be prepended in parsed result. Default isFalse
.path (boolean, optional) – If
True
, path will be included in parsed result. Default isTrue
.use_netloc (boolean, optional) – If
True
urlparse’s netloc will be used. IfFalse
urlparse’s host will be returned. Using netloc means that a port is included, for example, if it was in the path. Default isTrue
.extractor (tldextract::TLDExtract, optional) – An (optional) tldextract::TLDExtract instance can be passed with keyword extractor, otherwise we create and update one automatically.
- Returns
Returns a url stripped to (scheme)?+(netloc|hostname)+(path)?. Returns empty string if appropriate.
- Return type
string