Code:
# pv_day.py
#!/usr/bin/env python#Coding=utf-8 fromMrjob.jobImportMrjob fromNginx_accesslog_parserImportNginxlineparserclassPvday (mrjob): Nginx_line_parser=Nginxlineparser ()defMapper (self, _, line): Self.nginx_line_parser.parse [Line] Day, _=str (self.nginx_line_parser.time_local). Split ()yieldDay, 1#every day of defreducer (self, Key, values):yieldkey, sum (values)defMain (): Pvday.run ()if __name__=='__main__': Main ()
Code Explanation:
Defines a job class that integrates the Mrjob class, which contains a well-defined steps.
A ' step ' consists of a mapper,combiner and a reducer, which are optional, but must use at least one.
The mapper () method has two parameters Key,value (in this case, key is ignored, each row of logs as a value), and a Key-value pair is generated.
The reduce () method takes a key and an iterative values, and generates many Key-value pairs (in this case, the calculation of each key corresponds to the values worth and the PV that corresponds to each day).
To perform a job in a different way:
Basic way:
# Python3 pv_day.py access_all.log-20161227 No configs found; Falling back on auto-configurationcreating temp directory/tmp/pv_day.root.20161228.022837.113256Running Step 1 of 1 ... Streaming final output from/tmp/pv_day.root.20161228.022837.113256/output ... "2016-12-27" 47783 "2016-12-26 " 299427Removing temp directory/tmp/pv_day.root.20161228.022837.113256 ...
Standard input stdin mode, this way only accepts the first file
# Python3 pv_day.py < Access_ all.log-20161227 No configs found; Falling back on Auto-configurationcreating temp directory /tmp/pv_ Day.root.20161228.024431.884434running step 1 of 1 ...reading from stdinstreaming final output from /tmp/pv_day.root.20161228.024431.884434/ Output ... " 2016-12-27 " 47783" 2016-12-26 " 299427removing temp directory /tmp/pv_day.root.20161228.024431.884434 ...
Blending mode:
Python3 pv_day.py input1.txt Input2.txt-< Input3.txt
Distributed:
By default, the Mrjob execution job uses a single Python process, which is just debugging, not accurate distributed computing!
If you are using distributed computing, you can use the-r/--runner option. Use-r inline (default),-R Local,-R Hadoop,-R EMR
# python pv_day.py-r Hadoop hdfs://my_home/input.txt
Second, the analysis of Nginx access log based on Hadoop---calculate Day PV