Second, the analysis of Nginx access log based on Hadoop---calculate Day PV

Source: Internet
Author: User

Code:
# pv_day.py
#!/usr/bin/env python#Coding=utf-8 fromMrjob.jobImportMrjob fromNginx_accesslog_parserImportNginxlineparserclassPvday (mrjob): Nginx_line_parser=Nginxlineparser ()defMapper (self, _, line): Self.nginx_line_parser.parse [Line] Day, _=str (self.nginx_line_parser.time_local). Split ()yieldDay, 1#every day of defreducer (self, Key, values):yieldkey, sum (values)defMain (): Pvday.run ()if __name__=='__main__': Main ()
Code Explanation:

Defines a job class that integrates the Mrjob class, which contains a well-defined steps.

A ' step ' consists of a mapper,combiner and a reducer, which are optional, but must use at least one.

The mapper () method has two parameters Key,value (in this case, key is ignored, each row of logs as a value), and a Key-value pair is generated.

The reduce () method takes a key and an iterative values, and generates many Key-value pairs (in this case, the calculation of each key corresponds to the values worth and the PV that corresponds to each day).

To perform a job in a different way:

Basic way:

# Python3 pv_day.py access_all.log-20161227 No configs found; Falling back on auto-configurationcreating temp directory/tmp/pv_day.root.20161228.022837.113256Running Step 1 of 1 ... Streaming final output from/tmp/pv_day.root.20161228.022837.113256/output ... "2016-12-27"    47783 "2016-12-26 "    299427Removing temp directory/tmp/pv_day.root.20161228.022837.113256     ...

Standard input stdin mode, this way only accepts the first file

 #   Python3 pv_day.py < Access_ all.log-20161227  No configs found; Falling back on Auto-configurationcreating temp directory /tmp/pv_ Day.root.20161228.024431.884434running step  1 of 1 ...reading  from   stdinstreaming final output  from /tmp/pv_day.root.20161228.024431.884434/ Output ...   " 2016-12-27   " 47783"   2016-12-26   " 299427removing temp directory /tmp/pv_day.root.20161228.024431.884434 ... 

Blending mode:

Python3 pv_day.py input1.txt Input2.txt-< Input3.txt

Distributed:

By default, the Mrjob execution job uses a single Python process, which is just debugging, not accurate distributed computing!

If you are using distributed computing, you can use the-r/--runner option. Use-r inline (default),-R Local,-R Hadoop,-R EMR

# python pv_day.py-r Hadoop hdfs://my_home/input.txt

Second, the analysis of Nginx access log based on Hadoop---calculate Day PV

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.