Usage examples¶
The following notebook demonstrates running domain_utils
with pandas and pyspark.
It was created on databricks, and covers installing domain_utils
into a databricks notebook and working with custom extractors on databricks.
[2]:
dbutils.library.installPyPI('domain_utils')
dbutils.library.restartPython()
[3]:
import domain_utils as du
from tldextract import TLDExtract
from pathlib import Path
import tempfile
[4]:
path = "path to crawl data"
Make a custom extractor¶
[6]:
tmp_path = Path(tempfile.mkdtemp())
dbutils.fs.mkdirs(tmp_path.as_uri())
local_list_location = tmp_path / "list.txt"
dbutils.fs.ls(tmp_path.as_uri())
Out[3]: []
I could find no way of making the following cell work with a local databricks temp file.
So I created a custom psl on gist with just the entry. This should be ok for most use cases. The need for a custom PSL should be pretty rare anyway.
googlesyndication.com
[8]:
http_loc = 'https://gist.githubusercontent.com/birdsarah/876ecbcaa5510fbcad65639ab7913edd/raw/cce905d186e0623e161af4f6730c2857a181373f/custom_psl_test.txt'
custom_extractor = TLDExtract(
suffix_list_urls=[http_loc, ],
cache_file=local_list_location.as_posix(),
fallback_to_snapshot=False
)
custom_extractor('foo.bar.googlesyndication.com')
Out[4]: ExtractResult(subdomain='foo', domain='bar', suffix='googlesyndication.com')
Pandas¶
[10]:
# Make a pandas dataframe to apply methods on
df = spark.read.parquet('%s/visits/%s' % (path, 'javascript')).select('script_url', 'document_url')
df_p = df.drop_duplicates().limit(100_000).toPandas()
[11]:
df_p['script_url_ps1'] = df_p.script_url.apply(du.get_ps_plus_1)
df_p['script_url_stemmed'] = df_p.script_url.apply(du.get_stripped_url)
df_p['script_w_scheme'] = df_p.script_url.apply(du.get_stripped_url, scheme=True)
df_p['doc_w_scheme'] = df_p.document_url.apply(du.get_stripped_url, scheme=True)
df_p.head(10)
script_url | document_url | script_url_ps1 | script_url_stemmed | script_w_scheme | doc_w_scheme | |
---|---|---|---|---|---|---|
0 | https://moonliteco.in/js/ion.sound.min.js | https://moonliteco.in/ | moonliteco.in | moonliteco.in/js/ion.sound.min.js | https://moonliteco.in/js/ion.sound.min.js | https://moonliteco.in/ |
1 | https://www.google-analytics.com/analytics.js | https://moonliteco.in/ | google-analytics.com | www.google-analytics.com/analytics.js | https://www.google-analytics.com/analytics.js | https://moonliteco.in/ |
2 | https://www.hotelscombined.com/QUkd4lO9/init.js | https://www.hotelscombined.com/TrafficInspecti... | hotelscombined.com | www.hotelscombined.com/QUkd4lO9/init.js | https://www.hotelscombined.com/QUkd4lO9/init.js | https://www.hotelscombined.com/TrafficInspecti... |
3 | https://apis.google.com/_/scs/apps-static/_/js... | https://www.i-gamer.net/mobile/site/3293.html | google.com | apis.google.com/_/scs/apps-static/_/js/k=oz.ga... | https://apis.google.com/_/scs/apps-static/_/js... | https://www.i-gamer.net/mobile/site/3293.html |
4 | https://pagead2.googlesyndication.com/bg/o1Put... | https://googleads.g.doubleclick.net/pagead/ads... | googlesyndication.com | pagead2.googlesyndication.com/bg/o1Putv1UN_aI0... | https://pagead2.googlesyndication.com/bg/o1Put... | https://googleads.g.doubleclick.net/pagead/ads |
5 | http://www.donews.com/static/js/sdk/lib/JSSDK-... | http://www.donews.com/ | donews.com | www.donews.com/static/js/sdk/lib/JSSDK-home_1.... | http://www.donews.com/static/js/sdk/lib/JSSDK-... | http://www.donews.com/ |
6 | https://themeforest.net/user/muffingroup | https://themeforest.net/user/muffingroup | themeforest.net | themeforest.net/user/muffingroup | https://themeforest.net/user/muffingroup | https://themeforest.net/user/muffingroup |
7 | https://www.gearbest.com/promotion-Life-Essent... | https://www.gearbest.com/promotion-Life-Essent... | gearbest.com | www.gearbest.com/promotion-Life-Essentials-Gad... | https://www.gearbest.com/promotion-Life-Essent... | https://www.gearbest.com/promotion-Life-Essent... |
8 | https://www.googletagmanager.com/gtm.js?id=GTM... | https://fivethirtyeight.abcnews.go.com/video/e... | googletagmanager.com | www.googletagmanager.com/gtm.js | https://www.googletagmanager.com/gtm.js | https://fivethirtyeight.abcnews.go.com/video/e... |
9 | https://connect.facebook.net/ja_JP/sdk.js#xfbm... | https://gigazine.net/news/20190729-acecook-sup... | facebook.net | connect.facebook.net/ja_JP/sdk.js | https://connect.facebook.net/ja_JP/sdk.js | https://gigazine.net/news/20190729-acecook-sup... |
[12]:
df_p['custom_ps1'] = df_p.script_url.apply(du.get_ps_plus_1, extractor=custom_extractor)
df_p.head(10)
script_url | document_url | script_url_ps1 | script_url_stemmed | script_w_scheme | doc_w_scheme | custom_ps1 | |
---|---|---|---|---|---|---|---|
0 | https://moonliteco.in/js/ion.sound.min.js | https://moonliteco.in/ | moonliteco.in | moonliteco.in/js/ion.sound.min.js | https://moonliteco.in/js/ion.sound.min.js | https://moonliteco.in/ | in |
1 | https://www.google-analytics.com/analytics.js | https://moonliteco.in/ | google-analytics.com | www.google-analytics.com/analytics.js | https://www.google-analytics.com/analytics.js | https://moonliteco.in/ | com |
2 | https://www.hotelscombined.com/QUkd4lO9/init.js | https://www.hotelscombined.com/TrafficInspecti... | hotelscombined.com | www.hotelscombined.com/QUkd4lO9/init.js | https://www.hotelscombined.com/QUkd4lO9/init.js | https://www.hotelscombined.com/TrafficInspecti... | com |
3 | https://apis.google.com/_/scs/apps-static/_/js... | https://www.i-gamer.net/mobile/site/3293.html | google.com | apis.google.com/_/scs/apps-static/_/js/k=oz.ga... | https://apis.google.com/_/scs/apps-static/_/js... | https://www.i-gamer.net/mobile/site/3293.html | com |
4 | https://pagead2.googlesyndication.com/bg/o1Put... | https://googleads.g.doubleclick.net/pagead/ads... | googlesyndication.com | pagead2.googlesyndication.com/bg/o1Putv1UN_aI0... | https://pagead2.googlesyndication.com/bg/o1Put... | https://googleads.g.doubleclick.net/pagead/ads | pagead2.googlesyndication.com |
5 | http://www.donews.com/static/js/sdk/lib/JSSDK-... | http://www.donews.com/ | donews.com | www.donews.com/static/js/sdk/lib/JSSDK-home_1.... | http://www.donews.com/static/js/sdk/lib/JSSDK-... | http://www.donews.com/ | com |
6 | https://themeforest.net/user/muffingroup | https://themeforest.net/user/muffingroup | themeforest.net | themeforest.net/user/muffingroup | https://themeforest.net/user/muffingroup | https://themeforest.net/user/muffingroup | net |
7 | https://www.gearbest.com/promotion-Life-Essent... | https://www.gearbest.com/promotion-Life-Essent... | gearbest.com | www.gearbest.com/promotion-Life-Essentials-Gad... | https://www.gearbest.com/promotion-Life-Essent... | https://www.gearbest.com/promotion-Life-Essent... | com |
8 | https://www.googletagmanager.com/gtm.js?id=GTM... | https://fivethirtyeight.abcnews.go.com/video/e... | googletagmanager.com | www.googletagmanager.com/gtm.js | https://www.googletagmanager.com/gtm.js | https://fivethirtyeight.abcnews.go.com/video/e... | com |
9 | https://connect.facebook.net/ja_JP/sdk.js#xfbm... | https://gigazine.net/news/20190729-acecook-sup... | facebook.net | connect.facebook.net/ja_JP/sdk.js | https://connect.facebook.net/ja_JP/sdk.js | https://gigazine.net/news/20190729-acecook-sup... | net |
Spark¶
[14]:
from pyspark.sql import functions as F, types as T
# This is the convoluted way I found to pass kwargs to a udf
def get_stripped_url_udf(**function_kwargs):
return F.udf(f=lambda x: du.get_stripped_url(x, **function_kwargs), returnType=T.StringType())
def get_ps_plus_1_udf(**function_kwargs):
return F.udf(f=lambda x: du.get_ps_plus_1(x, **function_kwargs), returnType=T.StringType())
[15]:
df = spark.read.parquet('%s/visits/%s' % (path, 'javascript')).select('script_url', 'document_url').dropDuplicates()
[16]:
# These are equivalent demonstrating with and without col syntax
df = df.withColumn('script_url_stripped', get_stripped_url_udf()(F.col('script_url')))
df = df.withColumn('script_url_stripped_2', get_stripped_url_udf()('script_url'))
df.limit(5).toPandas()
script_url | document_url | script_url_stripped | script_url_stripped_2 | |
---|---|---|---|---|
0 | https://stats.wp.com/w.js?60 | https://heavy.com/entertainment/2019/07/could-... | stats.wp.com/w.js | stats.wp.com/w.js |
1 | http://platform-api.sharethis.com/js/sharethis... | http://www.chasethetrend.com/category/stories/ | platform-api.sharethis.com/js/sharethis.js | platform-api.sharethis.com/js/sharethis.js |
2 | https://cdn.tinypass.com/api/tinypass.min.js | https://www.thedailybeast.com/category/us-news | cdn.tinypass.com/api/tinypass.min.js | cdn.tinypass.com/api/tinypass.min.js |
3 | https://vidstat.taboola.com/vpaid/units/23_7_1... | https://www.gazetaexpress.com/arbenita-ismajli... | vidstat.taboola.com/vpaid/units/23_7_1/infra/c... | vidstat.taboola.com/vpaid/units/23_7_1/infra/c... |
4 | https://pixel.yabidos.com/fltiu.js?qid=5373031... | https://www.gridoto.com/read/221801860/pengend... | pixel.yabidos.com/fltiu.js | pixel.yabidos.com/fltiu.js |
[17]:
custom_extractor('foo.googlesyndication.com')
Out[33]: ExtractResult(subdomain='', domain='foo', suffix='googlesyndication.com')
Because spark is non deteterministic we don’t always get back a hit to test the googlesyndication entry, but we can see that it’s working anyways.
[19]:
df = (
df
.withColumn('document_url_ps1', get_ps_plus_1_udf()(F.col('document_url')))
.withColumn('script_url_stripped_w_scheme', get_stripped_url_udf(scheme=True)(F.col('script_url')))
.withColumn('custom_ps1', get_ps_plus_1_udf(extractor=custom_extractor)(F.col('document_url')))
)
df.limit(10).toPandas()
script_url | document_url | script_url_stripped | script_url_stripped_2 | document_url_ps1 | script_url_stripped_w_scheme | custom_ps1 | |
---|---|---|---|---|---|---|---|
0 | https://cdn.krxd.net/ctjs/controltag.js.05f9d0... | https://as.com/autor/diario_as/a/ | cdn.krxd.net/ctjs/controltag.js.05f9d0dad02f8a... | cdn.krxd.net/ctjs/controltag.js.05f9d0dad02f8a... | as.com | https://cdn.krxd.net/ctjs/controltag.js.05f9d0... | com |
1 | https://www.googletagservices.com/tag/js/gpt.js | https://www.storm.mg/reading-inspiration | www.googletagservices.com/tag/js/gpt.js | www.googletagservices.com/tag/js/gpt.js | storm.mg | https://www.googletagservices.com/tag/js/gpt.js | mg |
2 | https://bttrack.com/engagement/js?goalId=14072... | https://www.schwab.com/ | bttrack.com/engagement/js | bttrack.com/engagement/js | schwab.com | https://bttrack.com/engagement/js | com |
3 | https://connect.facebook.net/tr_TR/sdk.js#xfbm... | https://www.kizlarsoruyor.com/kisilik-karakter | connect.facebook.net/tr_TR/sdk.js | connect.facebook.net/tr_TR/sdk.js | kizlarsoruyor.com | https://connect.facebook.net/tr_TR/sdk.js | com |
4 | https://www.drtuber.com/signup | https://www.drtuber.com/signup | www.drtuber.com/signup | www.drtuber.com/signup | drtuber.com | https://www.drtuber.com/signup | com |
5 | https://c1.sfdcstatic.com/etc/clientlibs/sfdc-... | https://www.salesforce.com/company/legal/sfdc-... | c1.sfdcstatic.com/etc/clientlibs/sfdc-aem-mast... | c1.sfdcstatic.com/etc/clientlibs/sfdc-aem-mast... | salesforce.com | https://c1.sfdcstatic.com/etc/clientlibs/sfdc-... | com |
6 | https://d31qbv1cthcecs.cloudfront.net/atrk.js | https://www.brilio.net/gadget/last-seen-whatsa... | d31qbv1cthcecs.cloudfront.net/atrk.js | d31qbv1cthcecs.cloudfront.net/atrk.js | brilio.net | https://d31qbv1cthcecs.cloudfront.net/atrk.js | net |
7 | https://www.googleadservices.com/pagead/conver... | https://ejje.weblio.jp/category/academic/itery | www.googleadservices.com/pagead/conversion_asy... | www.googleadservices.com/pagead/conversion_asy... | weblio.jp | https://www.googleadservices.com/pagead/conver... | jp |
8 | https://g.alicdn.com/alilog/mlog/aplus_v2.js | https://food.tmall.com/ | g.alicdn.com/alilog/mlog/aplus_v2.js | g.alicdn.com/alilog/mlog/aplus_v2.js | tmall.com | https://g.alicdn.com/alilog/mlog/aplus_v2.js | com |
9 | https://assets.alicdn.com/g/security/umscript/... | https://g.alicdn.com/alilog/oneplus/blk.html#c... | assets.alicdn.com/g/security/umscript/2.1.4/um.js | assets.alicdn.com/g/security/umscript/2.1.4/um.js | alicdn.com | https://assets.alicdn.com/g/security/umscript/... | com |
[20]: