UserAgent:
Code (does not contain spiders):
#Cat top_10_useragent.py#!/usr/bin/env python#Coding=utf-8 fromMrjob.jobImportMrjob fromMrjob.stepImportMrstep fromNginx_accesslog_parserImportNginxlineparserImportHEAPQclassuseragent (mrjob): Nginx_line_parser=Nginxlineparser ()defMapper (self, _, line): Self.nginx_line_parser.parse (line) Field_item=self.nginx_line_parser.http_user_agentifField_item is notNone:yieldField_item, 1defreducer_sum (self, Key, values):yieldNone, (sum (values), key)defreducer_top100 (Self, _, values): forCount, PathinchHeapq.nlargest (10, values):yieldcount, Path#for count, path in sorted (values, reverse=true) [:]: #yield count, path defSteps (self):return(Mrstep (Mapper=Self.mapper, Reducer=self.reducer_sum), Mrstep (reducer=self.reducer_top100))defMain (): Useragent.run ()if __name__=='__main__': Main ()
Results:
#Python3 top_10_useragent.py access_all.log-20161227No Configs found; Falling back on auto-configurationcreating Temp directory/tmp/top_10_useragent.root.20161228.090725.308144Running Step1 of 2... Running Step2 of 2... Streaming final output from/tmp/top_10_useragent.root.20161228.090725.308144/output ...85262"IE"79611"Chrome"48560" Other"10662"Firefox"7927"Mobile Safari Ui/wkwebview"7182"Sogou Explorer"6681"QQ Browser"1988"Mobile Safari"1781"Maxthon"1404"Edge"Removing temp directory/tmp/top_10_useragent.root.20161228.090725.308144 ...
Spider:
#!/usr/bin/env python#Coding=utf-8 fromMrjob.jobImportMrjob fromMrjob.stepImportMrstep fromNginx_accesslog_parserImportNginxlineparserImportHEAPQclassSpider (mrjob): Nginx_line_parser=Nginxlineparser ()defMapper (self, _, line): Self.nginx_line_parser.parse (line) Field_item=Self.nginx_line_parser.user_agent_typeifField_item is notNone:yieldField_item, 1defreducer_sum (self, Key, values):yieldNone, (sum (values), key)defreducer_top100 (Self, _, values): forCount, PathinchHeapq.nlargest (10, values):yieldcount, Path#for count, path in sorted (values, reverse=true) [:]: #yield count, path defSteps (self):return(Mrstep (Mapper=Self.mapper, Reducer=self.reducer_sum), Mrstep (reducer=self.reducer_top100))defMain (): Spider.run ()if __name__=='__main__': Main ()
Execution Result:
#Python3 top_10_spider.py access_all.log-20161227No Configs found; Falling back on auto-configurationcreating Temp directory/tmp/top_10_spider.root.20161228.091326.295972Running Step1 of 2... Running Step2 of 2... Streaming final output from/tmp/top_10_spider.root.20161228.091326.295972/output ...33542"Magpie-crawler"25880" Other"16578"Sogou web Spider"6383"Bingbot"3688"Baiduspider"1487"Yahoo! slurp"1096"Jikespider"731"Yisouspider"648"Baiduspider-image"470"Googlebot"Removing temp directory/tmp/top_10_spider.root.20161228.091326.295972 ...
V. Analysis of the Nginx access log based on Hadoop--useragent and Spider