{"cells":[{"cell_type":"markdown","source":["# Usage examples\n\nThe following notebook demonstrates running ``domain_utils`` with [pandas](https://pandas.pydata.org) and [pyspark](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html). \n\nIt was created on databricks, and covers installing ``domain_utils`` into a databricks notebook and working with custom extractors on databricks."],"metadata":{}},{"cell_type":"code","source":["dbutils.library.installPyPI('domain_utils')\n\ndbutils.library.restartPython()"],"metadata":{},"outputs":[{"metadata":{},"output_type":"display_data","data":{"text/html":["\n
"]}}],"execution_count":2},{"cell_type":"code","source":["import domain_utils as du\nfrom tldextract import TLDExtract\nfrom pathlib import Path\nimport tempfile"],"metadata":{},"outputs":[{"metadata":{},"output_type":"display_data","data":{"text/html":["\n"]}}],"execution_count":3},{"cell_type":"code","source":["path = \"path to crawl data\""],"metadata":{},"outputs":[{"metadata":{},"output_type":"display_data","data":{"text/html":["\n"]}}],"execution_count":4},{"cell_type":"markdown","source":["## Make a custom extractor"],"metadata":{}},{"cell_type":"code","source":["tmp_path = Path(tempfile.mkdtemp())\ndbutils.fs.mkdirs(tmp_path.as_uri())\nlocal_list_location = tmp_path / \"list.txt\"\ndbutils.fs.ls(tmp_path.as_uri())"],"metadata":{},"outputs":[{"metadata":{},"output_type":"display_data","data":{"text/html":["\n\n | script_url | \ndocument_url | \nscript_url_ps1 | \nscript_url_stemmed | \nscript_w_scheme | \ndoc_w_scheme | \n
---|---|---|---|---|---|---|
0 | \nhttps://moonliteco.in/js/ion.sound.min.js | \nhttps://moonliteco.in/ | \nmoonliteco.in | \nmoonliteco.in/js/ion.sound.min.js | \nhttps://moonliteco.in/js/ion.sound.min.js | \nhttps://moonliteco.in/ | \n
1 | \nhttps://www.google-analytics.com/analytics.js | \nhttps://moonliteco.in/ | \ngoogle-analytics.com | \nwww.google-analytics.com/analytics.js | \nhttps://www.google-analytics.com/analytics.js | \nhttps://moonliteco.in/ | \n
2 | \nhttps://www.hotelscombined.com/QUkd4lO9/init.js | \nhttps://www.hotelscombined.com/TrafficInspecti... | \nhotelscombined.com | \nwww.hotelscombined.com/QUkd4lO9/init.js | \nhttps://www.hotelscombined.com/QUkd4lO9/init.js | \nhttps://www.hotelscombined.com/TrafficInspecti... | \n
3 | \nhttps://apis.google.com/_/scs/apps-static/_/js... | \nhttps://www.i-gamer.net/mobile/site/3293.html | \ngoogle.com | \napis.google.com/_/scs/apps-static/_/js/k=oz.ga... | \nhttps://apis.google.com/_/scs/apps-static/_/js... | \nhttps://www.i-gamer.net/mobile/site/3293.html | \n
4 | \nhttps://pagead2.googlesyndication.com/bg/o1Put... | \nhttps://googleads.g.doubleclick.net/pagead/ads... | \ngooglesyndication.com | \npagead2.googlesyndication.com/bg/o1Putv1UN_aI0... | \nhttps://pagead2.googlesyndication.com/bg/o1Put... | \nhttps://googleads.g.doubleclick.net/pagead/ads | \n
5 | \nhttp://www.donews.com/static/js/sdk/lib/JSSDK-... | \nhttp://www.donews.com/ | \ndonews.com | \nwww.donews.com/static/js/sdk/lib/JSSDK-home_1.... | \nhttp://www.donews.com/static/js/sdk/lib/JSSDK-... | \nhttp://www.donews.com/ | \n
6 | \nhttps://themeforest.net/user/muffingroup | \nhttps://themeforest.net/user/muffingroup | \nthemeforest.net | \nthemeforest.net/user/muffingroup | \nhttps://themeforest.net/user/muffingroup | \nhttps://themeforest.net/user/muffingroup | \n
7 | \nhttps://www.gearbest.com/promotion-Life-Essent... | \nhttps://www.gearbest.com/promotion-Life-Essent... | \ngearbest.com | \nwww.gearbest.com/promotion-Life-Essentials-Gad... | \nhttps://www.gearbest.com/promotion-Life-Essent... | \nhttps://www.gearbest.com/promotion-Life-Essent... | \n
8 | \nhttps://www.googletagmanager.com/gtm.js?id=GTM... | \nhttps://fivethirtyeight.abcnews.go.com/video/e... | \ngoogletagmanager.com | \nwww.googletagmanager.com/gtm.js | \nhttps://www.googletagmanager.com/gtm.js | \nhttps://fivethirtyeight.abcnews.go.com/video/e... | \n
9 | \nhttps://connect.facebook.net/ja_JP/sdk.js#xfbm... | \nhttps://gigazine.net/news/20190729-acecook-sup... | \nfacebook.net | \nconnect.facebook.net/ja_JP/sdk.js | \nhttps://connect.facebook.net/ja_JP/sdk.js | \nhttps://gigazine.net/news/20190729-acecook-sup... | \n
\n | script_url | \ndocument_url | \nscript_url_ps1 | \nscript_url_stemmed | \nscript_w_scheme | \ndoc_w_scheme | \ncustom_ps1 | \n
---|---|---|---|---|---|---|---|
0 | \nhttps://moonliteco.in/js/ion.sound.min.js | \nhttps://moonliteco.in/ | \nmoonliteco.in | \nmoonliteco.in/js/ion.sound.min.js | \nhttps://moonliteco.in/js/ion.sound.min.js | \nhttps://moonliteco.in/ | \nin | \n
1 | \nhttps://www.google-analytics.com/analytics.js | \nhttps://moonliteco.in/ | \ngoogle-analytics.com | \nwww.google-analytics.com/analytics.js | \nhttps://www.google-analytics.com/analytics.js | \nhttps://moonliteco.in/ | \ncom | \n
2 | \nhttps://www.hotelscombined.com/QUkd4lO9/init.js | \nhttps://www.hotelscombined.com/TrafficInspecti... | \nhotelscombined.com | \nwww.hotelscombined.com/QUkd4lO9/init.js | \nhttps://www.hotelscombined.com/QUkd4lO9/init.js | \nhttps://www.hotelscombined.com/TrafficInspecti... | \ncom | \n
3 | \nhttps://apis.google.com/_/scs/apps-static/_/js... | \nhttps://www.i-gamer.net/mobile/site/3293.html | \ngoogle.com | \napis.google.com/_/scs/apps-static/_/js/k=oz.ga... | \nhttps://apis.google.com/_/scs/apps-static/_/js... | \nhttps://www.i-gamer.net/mobile/site/3293.html | \ncom | \n
4 | \nhttps://pagead2.googlesyndication.com/bg/o1Put... | \nhttps://googleads.g.doubleclick.net/pagead/ads... | \ngooglesyndication.com | \npagead2.googlesyndication.com/bg/o1Putv1UN_aI0... | \nhttps://pagead2.googlesyndication.com/bg/o1Put... | \nhttps://googleads.g.doubleclick.net/pagead/ads | \npagead2.googlesyndication.com | \n
5 | \nhttp://www.donews.com/static/js/sdk/lib/JSSDK-... | \nhttp://www.donews.com/ | \ndonews.com | \nwww.donews.com/static/js/sdk/lib/JSSDK-home_1.... | \nhttp://www.donews.com/static/js/sdk/lib/JSSDK-... | \nhttp://www.donews.com/ | \ncom | \n
6 | \nhttps://themeforest.net/user/muffingroup | \nhttps://themeforest.net/user/muffingroup | \nthemeforest.net | \nthemeforest.net/user/muffingroup | \nhttps://themeforest.net/user/muffingroup | \nhttps://themeforest.net/user/muffingroup | \nnet | \n
7 | \nhttps://www.gearbest.com/promotion-Life-Essent... | \nhttps://www.gearbest.com/promotion-Life-Essent... | \ngearbest.com | \nwww.gearbest.com/promotion-Life-Essentials-Gad... | \nhttps://www.gearbest.com/promotion-Life-Essent... | \nhttps://www.gearbest.com/promotion-Life-Essent... | \ncom | \n
8 | \nhttps://www.googletagmanager.com/gtm.js?id=GTM... | \nhttps://fivethirtyeight.abcnews.go.com/video/e... | \ngoogletagmanager.com | \nwww.googletagmanager.com/gtm.js | \nhttps://www.googletagmanager.com/gtm.js | \nhttps://fivethirtyeight.abcnews.go.com/video/e... | \ncom | \n
9 | \nhttps://connect.facebook.net/ja_JP/sdk.js#xfbm... | \nhttps://gigazine.net/news/20190729-acecook-sup... | \nfacebook.net | \nconnect.facebook.net/ja_JP/sdk.js | \nhttps://connect.facebook.net/ja_JP/sdk.js | \nhttps://gigazine.net/news/20190729-acecook-sup... | \nnet | \n
\n | script_url | \ndocument_url | \nscript_url_stripped | \nscript_url_stripped_2 | \n
---|---|---|---|---|
0 | \nhttps://stats.wp.com/w.js?60 | \nhttps://heavy.com/entertainment/2019/07/could-... | \nstats.wp.com/w.js | \nstats.wp.com/w.js | \n
1 | \nhttp://platform-api.sharethis.com/js/sharethis... | \nhttp://www.chasethetrend.com/category/stories/ | \nplatform-api.sharethis.com/js/sharethis.js | \nplatform-api.sharethis.com/js/sharethis.js | \n
2 | \nhttps://cdn.tinypass.com/api/tinypass.min.js | \nhttps://www.thedailybeast.com/category/us-news | \ncdn.tinypass.com/api/tinypass.min.js | \ncdn.tinypass.com/api/tinypass.min.js | \n
3 | \nhttps://vidstat.taboola.com/vpaid/units/23_7_1... | \nhttps://www.gazetaexpress.com/arbenita-ismajli... | \nvidstat.taboola.com/vpaid/units/23_7_1/infra/c... | \nvidstat.taboola.com/vpaid/units/23_7_1/infra/c... | \n
4 | \nhttps://pixel.yabidos.com/fltiu.js?qid=5373031... | \nhttps://www.gridoto.com/read/221801860/pengend... | \npixel.yabidos.com/fltiu.js | \npixel.yabidos.com/fltiu.js | \n
\n | script_url | \ndocument_url | \nscript_url_stripped | \nscript_url_stripped_2 | \ndocument_url_ps1 | \nscript_url_stripped_w_scheme | \ncustom_ps1 | \n
---|---|---|---|---|---|---|---|
0 | \nhttps://cdn.krxd.net/ctjs/controltag.js.05f9d0... | \nhttps://as.com/autor/diario_as/a/ | \ncdn.krxd.net/ctjs/controltag.js.05f9d0dad02f8a... | \ncdn.krxd.net/ctjs/controltag.js.05f9d0dad02f8a... | \nas.com | \nhttps://cdn.krxd.net/ctjs/controltag.js.05f9d0... | \ncom | \n
1 | \nhttps://www.googletagservices.com/tag/js/gpt.js | \nhttps://www.storm.mg/reading-inspiration | \nwww.googletagservices.com/tag/js/gpt.js | \nwww.googletagservices.com/tag/js/gpt.js | \nstorm.mg | \nhttps://www.googletagservices.com/tag/js/gpt.js | \nmg | \n
2 | \nhttps://bttrack.com/engagement/js?goalId=14072... | \nhttps://www.schwab.com/ | \nbttrack.com/engagement/js | \nbttrack.com/engagement/js | \nschwab.com | \nhttps://bttrack.com/engagement/js | \ncom | \n
3 | \nhttps://connect.facebook.net/tr_TR/sdk.js#xfbm... | \nhttps://www.kizlarsoruyor.com/kisilik-karakter | \nconnect.facebook.net/tr_TR/sdk.js | \nconnect.facebook.net/tr_TR/sdk.js | \nkizlarsoruyor.com | \nhttps://connect.facebook.net/tr_TR/sdk.js | \ncom | \n
4 | \nhttps://www.drtuber.com/signup | \nhttps://www.drtuber.com/signup | \nwww.drtuber.com/signup | \nwww.drtuber.com/signup | \ndrtuber.com | \nhttps://www.drtuber.com/signup | \ncom | \n
5 | \nhttps://c1.sfdcstatic.com/etc/clientlibs/sfdc-... | \nhttps://www.salesforce.com/company/legal/sfdc-... | \nc1.sfdcstatic.com/etc/clientlibs/sfdc-aem-mast... | \nc1.sfdcstatic.com/etc/clientlibs/sfdc-aem-mast... | \nsalesforce.com | \nhttps://c1.sfdcstatic.com/etc/clientlibs/sfdc-... | \ncom | \n
6 | \nhttps://d31qbv1cthcecs.cloudfront.net/atrk.js | \nhttps://www.brilio.net/gadget/last-seen-whatsa... | \nd31qbv1cthcecs.cloudfront.net/atrk.js | \nd31qbv1cthcecs.cloudfront.net/atrk.js | \nbrilio.net | \nhttps://d31qbv1cthcecs.cloudfront.net/atrk.js | \nnet | \n
7 | \nhttps://www.googleadservices.com/pagead/conver... | \nhttps://ejje.weblio.jp/category/academic/itery | \nwww.googleadservices.com/pagead/conversion_asy... | \nwww.googleadservices.com/pagead/conversion_asy... | \nweblio.jp | \nhttps://www.googleadservices.com/pagead/conver... | \njp | \n
8 | \nhttps://g.alicdn.com/alilog/mlog/aplus_v2.js | \nhttps://food.tmall.com/ | \ng.alicdn.com/alilog/mlog/aplus_v2.js | \ng.alicdn.com/alilog/mlog/aplus_v2.js | \ntmall.com | \nhttps://g.alicdn.com/alilog/mlog/aplus_v2.js | \ncom | \n
9 | \nhttps://assets.alicdn.com/g/security/umscript/... | \nhttps://g.alicdn.com/alilog/oneplus/blk.html#c... | \nassets.alicdn.com/g/security/umscript/2.1.4/um.js | \nassets.alicdn.com/g/security/umscript/2.1.4/um.js | \nalicdn.com | \nhttps://assets.alicdn.com/g/security/umscript/... | \ncom | \n