Usage examples

The following notebook demonstrates running domain_utils with pandas and pyspark.

It was created on databricks, and covers installing domain_utils into a databricks notebook and working with custom extractors on databricks.

[2]:
dbutils.library.installPyPI('domain_utils')

dbutils.library.restartPython()
[3]:
import domain_utils as du
from tldextract import TLDExtract
from pathlib import Path
import tempfile
[4]:
path = "path to crawl data"

Make a custom extractor

[6]:
tmp_path = Path(tempfile.mkdtemp())
dbutils.fs.mkdirs(tmp_path.as_uri())
local_list_location = tmp_path / "list.txt"
dbutils.fs.ls(tmp_path.as_uri())
Out[3]: []

I could find no way of making the following cell work with a local databricks temp file.

So I created a custom psl on gist with just the entry. This should be ok for most use cases. The need for a custom PSL should be pretty rare anyway.

googlesyndication.com
[8]:
http_loc = 'https://gist.githubusercontent.com/birdsarah/876ecbcaa5510fbcad65639ab7913edd/raw/cce905d186e0623e161af4f6730c2857a181373f/custom_psl_test.txt'
custom_extractor = TLDExtract(
    suffix_list_urls=[http_loc, ],
    cache_file=local_list_location.as_posix(),
    fallback_to_snapshot=False
)
custom_extractor('foo.bar.googlesyndication.com')
Out[4]: ExtractResult(subdomain='foo', domain='bar', suffix='googlesyndication.com')

Pandas

[10]:
# Make a pandas dataframe to apply methods on
df = spark.read.parquet('%s/visits/%s' % (path, 'javascript')).select('script_url', 'document_url')
df_p = df.drop_duplicates().limit(100_000).toPandas()
[11]:
df_p['script_url_ps1'] = df_p.script_url.apply(du.get_ps_plus_1)
df_p['script_url_stemmed'] = df_p.script_url.apply(du.get_stripped_url)
df_p['script_w_scheme'] = df_p.script_url.apply(du.get_stripped_url, scheme=True)
df_p['doc_w_scheme'] = df_p.document_url.apply(du.get_stripped_url, scheme=True)
df_p.head(10)
script_url document_url script_url_ps1 script_url_stemmed script_w_scheme doc_w_scheme
0 https://moonliteco.in/js/ion.sound.min.js https://moonliteco.in/ moonliteco.in moonliteco.in/js/ion.sound.min.js https://moonliteco.in/js/ion.sound.min.js https://moonliteco.in/
1 https://www.google-analytics.com/analytics.js https://moonliteco.in/ google-analytics.com www.google-analytics.com/analytics.js https://www.google-analytics.com/analytics.js https://moonliteco.in/
2 https://www.hotelscombined.com/QUkd4lO9/init.js https://www.hotelscombined.com/TrafficInspecti... hotelscombined.com www.hotelscombined.com/QUkd4lO9/init.js https://www.hotelscombined.com/QUkd4lO9/init.js https://www.hotelscombined.com/TrafficInspecti...
3 https://apis.google.com/_/scs/apps-static/_/js... https://www.i-gamer.net/mobile/site/3293.html google.com apis.google.com/_/scs/apps-static/_/js/k=oz.ga... https://apis.google.com/_/scs/apps-static/_/js... https://www.i-gamer.net/mobile/site/3293.html
4 https://pagead2.googlesyndication.com/bg/o1Put... https://googleads.g.doubleclick.net/pagead/ads... googlesyndication.com pagead2.googlesyndication.com/bg/o1Putv1UN_aI0... https://pagead2.googlesyndication.com/bg/o1Put... https://googleads.g.doubleclick.net/pagead/ads
5 http://www.donews.com/static/js/sdk/lib/JSSDK-... http://www.donews.com/ donews.com www.donews.com/static/js/sdk/lib/JSSDK-home_1.... http://www.donews.com/static/js/sdk/lib/JSSDK-... http://www.donews.com/
6 https://themeforest.net/user/muffingroup https://themeforest.net/user/muffingroup themeforest.net themeforest.net/user/muffingroup https://themeforest.net/user/muffingroup https://themeforest.net/user/muffingroup
7 https://www.gearbest.com/promotion-Life-Essent... https://www.gearbest.com/promotion-Life-Essent... gearbest.com www.gearbest.com/promotion-Life-Essentials-Gad... https://www.gearbest.com/promotion-Life-Essent... https://www.gearbest.com/promotion-Life-Essent...
8 https://www.googletagmanager.com/gtm.js?id=GTM... https://fivethirtyeight.abcnews.go.com/video/e... googletagmanager.com www.googletagmanager.com/gtm.js https://www.googletagmanager.com/gtm.js https://fivethirtyeight.abcnews.go.com/video/e...
9 https://connect.facebook.net/ja_JP/sdk.js#xfbm... https://gigazine.net/news/20190729-acecook-sup... facebook.net connect.facebook.net/ja_JP/sdk.js https://connect.facebook.net/ja_JP/sdk.js https://gigazine.net/news/20190729-acecook-sup...
[12]:
df_p['custom_ps1'] = df_p.script_url.apply(du.get_ps_plus_1, extractor=custom_extractor)
df_p.head(10)
script_url document_url script_url_ps1 script_url_stemmed script_w_scheme doc_w_scheme custom_ps1
0 https://moonliteco.in/js/ion.sound.min.js https://moonliteco.in/ moonliteco.in moonliteco.in/js/ion.sound.min.js https://moonliteco.in/js/ion.sound.min.js https://moonliteco.in/ in
1 https://www.google-analytics.com/analytics.js https://moonliteco.in/ google-analytics.com www.google-analytics.com/analytics.js https://www.google-analytics.com/analytics.js https://moonliteco.in/ com
2 https://www.hotelscombined.com/QUkd4lO9/init.js https://www.hotelscombined.com/TrafficInspecti... hotelscombined.com www.hotelscombined.com/QUkd4lO9/init.js https://www.hotelscombined.com/QUkd4lO9/init.js https://www.hotelscombined.com/TrafficInspecti... com
3 https://apis.google.com/_/scs/apps-static/_/js... https://www.i-gamer.net/mobile/site/3293.html google.com apis.google.com/_/scs/apps-static/_/js/k=oz.ga... https://apis.google.com/_/scs/apps-static/_/js... https://www.i-gamer.net/mobile/site/3293.html com
4 https://pagead2.googlesyndication.com/bg/o1Put... https://googleads.g.doubleclick.net/pagead/ads... googlesyndication.com pagead2.googlesyndication.com/bg/o1Putv1UN_aI0... https://pagead2.googlesyndication.com/bg/o1Put... https://googleads.g.doubleclick.net/pagead/ads pagead2.googlesyndication.com
5 http://www.donews.com/static/js/sdk/lib/JSSDK-... http://www.donews.com/ donews.com www.donews.com/static/js/sdk/lib/JSSDK-home_1.... http://www.donews.com/static/js/sdk/lib/JSSDK-... http://www.donews.com/ com
6 https://themeforest.net/user/muffingroup https://themeforest.net/user/muffingroup themeforest.net themeforest.net/user/muffingroup https://themeforest.net/user/muffingroup https://themeforest.net/user/muffingroup net
7 https://www.gearbest.com/promotion-Life-Essent... https://www.gearbest.com/promotion-Life-Essent... gearbest.com www.gearbest.com/promotion-Life-Essentials-Gad... https://www.gearbest.com/promotion-Life-Essent... https://www.gearbest.com/promotion-Life-Essent... com
8 https://www.googletagmanager.com/gtm.js?id=GTM... https://fivethirtyeight.abcnews.go.com/video/e... googletagmanager.com www.googletagmanager.com/gtm.js https://www.googletagmanager.com/gtm.js https://fivethirtyeight.abcnews.go.com/video/e... com
9 https://connect.facebook.net/ja_JP/sdk.js#xfbm... https://gigazine.net/news/20190729-acecook-sup... facebook.net connect.facebook.net/ja_JP/sdk.js https://connect.facebook.net/ja_JP/sdk.js https://gigazine.net/news/20190729-acecook-sup... net

Spark

[14]:
from pyspark.sql import functions as F, types as T

# This is the convoluted way I found to pass kwargs to a udf

def get_stripped_url_udf(**function_kwargs):
  return F.udf(f=lambda x: du.get_stripped_url(x, **function_kwargs), returnType=T.StringType())

def get_ps_plus_1_udf(**function_kwargs):
  return F.udf(f=lambda x: du.get_ps_plus_1(x, **function_kwargs), returnType=T.StringType())
[15]:
df = spark.read.parquet('%s/visits/%s' % (path, 'javascript')).select('script_url', 'document_url').dropDuplicates()
[16]:
# These are equivalent demonstrating with and without col syntax
df = df.withColumn('script_url_stripped', get_stripped_url_udf()(F.col('script_url')))
df = df.withColumn('script_url_stripped_2', get_stripped_url_udf()('script_url'))
df.limit(5).toPandas()
script_url document_url script_url_stripped script_url_stripped_2
0 https://stats.wp.com/w.js?60 https://heavy.com/entertainment/2019/07/could-... stats.wp.com/w.js stats.wp.com/w.js
1 http://platform-api.sharethis.com/js/sharethis... http://www.chasethetrend.com/category/stories/ platform-api.sharethis.com/js/sharethis.js platform-api.sharethis.com/js/sharethis.js
2 https://cdn.tinypass.com/api/tinypass.min.js https://www.thedailybeast.com/category/us-news cdn.tinypass.com/api/tinypass.min.js cdn.tinypass.com/api/tinypass.min.js
3 https://vidstat.taboola.com/vpaid/units/23_7_1... https://www.gazetaexpress.com/arbenita-ismajli... vidstat.taboola.com/vpaid/units/23_7_1/infra/c... vidstat.taboola.com/vpaid/units/23_7_1/infra/c...
4 https://pixel.yabidos.com/fltiu.js?qid=5373031... https://www.gridoto.com/read/221801860/pengend... pixel.yabidos.com/fltiu.js pixel.yabidos.com/fltiu.js
[17]:
custom_extractor('foo.googlesyndication.com')
Out[33]: ExtractResult(subdomain='', domain='foo', suffix='googlesyndication.com')

Because spark is non deteterministic we don’t always get back a hit to test the googlesyndication entry, but we can see that it’s working anyways.

[19]:
df = (
  df
  .withColumn('document_url_ps1', get_ps_plus_1_udf()(F.col('document_url')))
  .withColumn('script_url_stripped_w_scheme', get_stripped_url_udf(scheme=True)(F.col('script_url')))
  .withColumn('custom_ps1', get_ps_plus_1_udf(extractor=custom_extractor)(F.col('document_url')))
)
df.limit(10).toPandas()
script_url document_url script_url_stripped script_url_stripped_2 document_url_ps1 script_url_stripped_w_scheme custom_ps1
0 https://cdn.krxd.net/ctjs/controltag.js.05f9d0... https://as.com/autor/diario_as/a/ cdn.krxd.net/ctjs/controltag.js.05f9d0dad02f8a... cdn.krxd.net/ctjs/controltag.js.05f9d0dad02f8a... as.com https://cdn.krxd.net/ctjs/controltag.js.05f9d0... com
1 https://www.googletagservices.com/tag/js/gpt.js https://www.storm.mg/reading-inspiration www.googletagservices.com/tag/js/gpt.js www.googletagservices.com/tag/js/gpt.js storm.mg https://www.googletagservices.com/tag/js/gpt.js mg
2 https://bttrack.com/engagement/js?goalId=14072... https://www.schwab.com/ bttrack.com/engagement/js bttrack.com/engagement/js schwab.com https://bttrack.com/engagement/js com
3 https://connect.facebook.net/tr_TR/sdk.js#xfbm... https://www.kizlarsoruyor.com/kisilik-karakter connect.facebook.net/tr_TR/sdk.js connect.facebook.net/tr_TR/sdk.js kizlarsoruyor.com https://connect.facebook.net/tr_TR/sdk.js com
4 https://www.drtuber.com/signup https://www.drtuber.com/signup www.drtuber.com/signup www.drtuber.com/signup drtuber.com https://www.drtuber.com/signup com
5 https://c1.sfdcstatic.com/etc/clientlibs/sfdc-... https://www.salesforce.com/company/legal/sfdc-... c1.sfdcstatic.com/etc/clientlibs/sfdc-aem-mast... c1.sfdcstatic.com/etc/clientlibs/sfdc-aem-mast... salesforce.com https://c1.sfdcstatic.com/etc/clientlibs/sfdc-... com
6 https://d31qbv1cthcecs.cloudfront.net/atrk.js https://www.brilio.net/gadget/last-seen-whatsa... d31qbv1cthcecs.cloudfront.net/atrk.js d31qbv1cthcecs.cloudfront.net/atrk.js brilio.net https://d31qbv1cthcecs.cloudfront.net/atrk.js net
7 https://www.googleadservices.com/pagead/conver... https://ejje.weblio.jp/category/academic/itery www.googleadservices.com/pagead/conversion_asy... www.googleadservices.com/pagead/conversion_asy... weblio.jp https://www.googleadservices.com/pagead/conver... jp
8 https://g.alicdn.com/alilog/mlog/aplus_v2.js https://food.tmall.com/ g.alicdn.com/alilog/mlog/aplus_v2.js g.alicdn.com/alilog/mlog/aplus_v2.js tmall.com https://g.alicdn.com/alilog/mlog/aplus_v2.js com
9 https://assets.alicdn.com/g/security/umscript/... https://g.alicdn.com/alilog/oneplus/blk.html#c... assets.alicdn.com/g/security/umscript/2.1.4/um.js assets.alicdn.com/g/security/umscript/2.1.4/um.js alicdn.com https://assets.alicdn.com/g/security/umscript/... com
[20]: