Build a hadoop-based mapreduce log analysis platform using python

Source: Internet
Author: User


650) this. width = 650; "src =" http://www.bkjia.com/uploads/allimg/131229/121I551N-0.jpg "title =" kv.jpg "alt =" 224356461.jpg"/>





If a log with a large traffic volume is directly written into Hadoop, the load on the Namenode is too large, so the logs of each node can be merged before the database is imported into the database and written into HDFS as a file. Merge regularly as needed and write it into hdfs.

Let's take a look at the log size. I compress the GB dns log file to 18 GB. If I use awk perl, I certainly can, but the processing speed is definitely not as powerful as distributed.

650) this. width = 650; "src =" http://www.bkjia.com/uploads/allimg/131229/121I5K42-1.jpg "title =" 2013-12-12_225703.jpg "alt =" 230102727.jpg"/>


Principles of Hadoop Streaming

Mapper and reducer read user data from the standard input, and send the data to the standard output after processing one row. The Streaming tool creates a MapReduce job and sends it to each tasktracker. It also monitors the execution process of the entire job.


Mapreduce can be implemented in any language as long as standard input and output are easily received ~

Before doing this, let's test the performance speed of shell simulation mapreduce ~

650) this. width = 650; "src =" http://www.bkjia.com/uploads/allimg/131229/121I510Y-2.jpg "title =" rsdfsdfr.jpg "alt =" 234955396.jpg"/>


Check his results. It takes about 35 seconds for a M file.

650) this. width = 650; "src =" http://www.bkjia.com/uploads/allimg/131229/121I55Q1-3.jpg "title =" 333.jpg" alt = "235045406.jpg"/>


This is a 2G log file, and it took 3 minutes. Of course there is a problem with the script I wrote. We simulate mapreduce, instead of calling awk and gawk under shell.


650) this. width = 650; "src =" http://www.bkjia.com/uploads/allimg/131229/121I54K8-4.jpg "title =" 2013-12-12_234808.jpg "alt =" 001056805.jpg"/>

Awk speed! I really like awk when processing logs, but it is a little difficult to learn. It is not as flexible and simple as other shell components.

650) this. width = 650; "src =" http://www.bkjia.com/uploads/allimg/131229/121I51624-5.jpg "title =" 2013-12-13_000942.jpg "alt =" 000946258.jpg"/>


This is two official demos ~

Map. py

#!/usr/bin/env python"""A more advanced Mapper, using Python iterators and generators."""import sysdef read_input(file):    for line in file:        # split the line into words        yield line.split()def main(separator='\t'):    # input comes from STDIN (standard input)    data = read_input(sys.stdin)    for words in data:        # write the results to STDOUT (standard output);        # what we output here will be the input for the        # Reduce step, i.e. the input for reducer.py        #        # tab-delimited; the trivial word count is 1        for word in words:            print '%s%s%d' % (word, separator, 1)if __name__ == "__main__":    main()


Reduce. py modification method

#!/usr/bin/env python"""A more advanced Reducer, using Python iterators and generators."""from itertools import groupbyfrom operator import itemgetterimport sysdef read_mapper_output(file, separator='\t'):    for line in file:        yield line.rstrip().split(separator, 1)def main(separator='\t'):    # input comes from STDIN (standard input)    data = read_mapper_output(sys.stdin, separator=separator)    # groupby groups multiple word-count pairs by word,    # and creates an iterator that returns consecutive keys and their group:    #   current_word - string containing a word (the key)    #   group - iterator yielding all ["<current_word>", "<count>"] items    for current_word, group in groupby(data, itemgetter(0)):        try:            total_count = sum(int(count) for current_word, count in group)            print "%s%s%d" % (current_word, separator, total_count)        except ValueError:            # count was not a number, so silently discard this item            passif __name__ == "__main__":    main()


Let's make it simple:

#!/usr/bin/env pythonimport sysfor line in sys.stdin:    line = line.strip()    words = line.split()    for word in words:        print '%s\t%s' % (word, 1)


#!/usr/bin/env python                                                                                                                                                                                                                                                                                                                                                              from operator import itemgetterimport sys                                                                                                                                                                                                                                                                                                                                                              current_word = Nonecurrent_count = 0word = None                                                                                                                                                                                                                                                                                                                                                              for line in sys.stdin:    line = line.strip()    word, count = line.split('\t', 1)    try:        count = int(count)    except ValueError:        continue    if current_word == word:        current_count += count    else:        if current_word:            print '%s\t%s' % (current_word, current_count)        current_count = count        current_word = word                                                                                                                                                                                                                                                                                                                                                             if current_word == word:    print '%s\t%s' % (current_word, current_count)


Let's simply simulate the data and run a test.


650) this. width = 650; "src =" http://www.bkjia.com/uploads/allimg/131229/121I5B47-6.jpg "title =" good.jpg "alt =" 084300004.jpg"/>



Nothing left. In the hadoop cluster environment, run hadoop's steaming. jar component. Add the mapreduce script and specify the output. in the following example, I use shell components.


[root@101 cron]#$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar \-input myInputDirs \-output myOutputDir \-mapper cat \-reducer wc


Detailed parameters. For our performance, we can increase the number of tasks and test the task based on the actual situation. This increases the workload.

1)-input: input file path

2)-output: output file path

3)-mapper: The mapper program written by the user. It can be an executable file or script.

4)-CER: the reducer program written by the user. It can be an executable file or script.

5)-file: Package the file to the submitted job. It can be the input file used by mapper or reducer, such as the configuration file and dictionary.

6)-partitioner: User-Defined partitioner Program

7)-combiner: the User-Defined combiner program must be implemented in java)

8)-D: Some attributes of the job used previously-jonconf), specifically:
1) mapred. map. tasks: Number of map tasks
2) mapred. reduce. tasks: Number of reduce tasks
3) stream. map. input. field. separator/stream. map. output. field. separator: number of inputs/outputs of map tasks
Data Separator. The default value is \ t.
4) stream. num. map. output. key. fields: specify the number of fields in the output record of the map task.
5) stream. reduce. input. field. separator/stream. reduce. output. field. separator: delimiter of input/output data of reduce task. The default value is \ t.
6) stream. num. reduce. output. key. fields: specify the number of fields in the output record of the reduce task.


Here is how many lines of dns log files ~

650) this. width = 650; "src =" http://www.bkjia.com/uploads/allimg/131229/121I56395-7.jpg "title =" 2013-12-14_1120.9.jpg "alt =" 114512821.jpg"/>

When mapreduce is used as a parameter, it cannot use too many shell languages that are too complex ~

It can be written as a shell file;

#! /bin/bashwhile read LINE; do#  for word in $LINE#  do#    echo "$word 1"        awk '{print $5}'                                                                                                          donedone


#! /bin/bashcount=0started=0word=""while read LINE;do  goodk=`echo $LINE | cut -d ' '  -f 1`  if [ "x" == x"$goodk" ];then     continue  fi  if [ "$word" != "$goodk" ];then    [ $started -ne 0 ] && echo -e "$word\t$count"    word=$goodk                                                                                                                     count=1    started=1  else    count=$(( $count + 1 ))  fidone


Sometimes this problem occurs. Take a good look at your own mapreduce program ~

13/12/14 13:26:52 INFO streaming. StreamJob: Tracking URL: http://101.rui.com: 50030/jobdetails. jsp? Jobid = job_201312131904_0030

13/12/14 13:26:53 INFO streaming. StreamJob: map 0% reduce 0%

13/12/14 13:27:16 INFO streaming. StreamJob: map 100% reduce 100%

13/12/14 13:27:16 INFO streaming. StreamJob: To kill this job, run:

13/12/14 13:27:16 INFO streaming. StreamJob:/usr/local/hadoop/libexec/../bin/hadoop job-Dmapred. job. tracker = localhost: 9001-kill job_201312131904_0030

13/12/14 13:27:16 INFO streaming. StreamJob: Tracking URL: http://101.rui.com: 50030/jobdetails. jsp? Jobid = job_201312131904_0030

13/12/14 13:27:16 ERROR streaming. StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201312131904_0030_m_000000

13/12/14 13:27:16 INFO streaming. StreamJob: killJob...

Streaming Command Failed!


After python is successfully executed as mapreduce, the results and logs are generally stored in the directory you specified. The results are in the part-00000 file ~


650) this. width = 650; "src =" http://www.bkjia.com/uploads/allimg/131229/121I56010-8.jpg "title =" 2013-12-14_151217.jpg "alt =" 172550144.jpg"/>


Next, let's talk about how to import data to the database and perform operations in the background.



This article is from "Fengyun, it's her ." Blog, declined to reprint!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.