使用python構建基於hadoop的mapreduce日誌分析平台

最後更新：2013-12-29 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

650) this.width=650;" src="http://www.bkjia.com/uploads/allimg/131229/121I551N-0.jpg" title="kv.jpg" alt="224356461.jpg" />

流量比較大的日誌要是直接寫入Hadoop對Namenode負載過大，所以入庫前合并，可以把各個節點的日誌湊並成一個檔案寫入HDFS。根據情況定期合成，寫入到hdfs裡面。

咱們看看日誌的大小，200G的dns記錄檔，我壓縮到了18G，要是用awk perl當然也可以，但是處理速度肯定沒有分布式那樣的給力。

650) this.width=650;" src="http://www.bkjia.com/uploads/allimg/131229/121I5K42-1.jpg" title="2013-12-12_225703.jpg" alt="230102727.jpg" />

Hadoop Streaming原理

mapper和reducer會從標準輸入中讀取使用者資料，一行一行處理後發送給標準輸出。Streaming工具會建立MapReduce作業，發送給各個tasktracker，同時監控整個作業的執行過程。

任何語言，只要是方便接收標準輸入輸出就可以做mapreduce~

再搞之前我們先簡單測試下shell類比mapreduce的效能速度~

650) this.width=650;" src="http://www.bkjia.com/uploads/allimg/131229/121I510Y-2.jpg" title="rsdfsdfr.jpg" alt="234955396.jpg" />

看下他的結果，350M的檔案用時35秒左右。

650) this.width=650;" src="http://www.bkjia.com/uploads/allimg/131229/121I55Q1-3.jpg" title="333.jpg" alt="235045406.jpg" />

這是2G的記錄檔，居然用了3分鐘。當然和我寫的指令碼也有問題，我們是類比mapreduce的方式，而不是調用shell下牛逼的awk，gawk處理。

650) this.width=650;" src="http://www.bkjia.com/uploads/allimg/131229/121I54K8-4.jpg" title="2013-12-12_234808.jpg" alt="001056805.jpg" />

awk的速度！果然很霸道，處理日誌的時候，我也很喜歡用awk，只是學習的難度有點大，不像別的shell組件那麼靈活簡單。

650) this.width=650;" src="http://www.bkjia.com/uploads/allimg/131229/121I51624-5.jpg" title="2013-12-13_000942.jpg" alt="000946258.jpg" />

這是官方的提供的兩個demo ~

map.py

#!/usr/bin/env python"""A more advanced Mapper, using Python iterators and generators."""import sysdef read_input(file):    for line in file:        # split the line into words        yield line.split()def main(separator='\t'):    # input comes from STDIN (standard input)    data = read_input(sys.stdin)    for words in data:        # write the results to STDOUT (standard output);        # what we output here will be the input for the        # Reduce step, i.e. the input for reducer.py        #        # tab-delimited; the trivial word count is 1        for word in words:            print '%s%s%d' % (word, separator, 1)if __name__ == "__main__":    main()

reduce.py的修改方式

#!/usr/bin/env python"""A more advanced Reducer, using Python iterators and generators."""from itertools import groupbyfrom operator import itemgetterimport sysdef read_mapper_output(file, separator='\t'):    for line in file:        yield line.rstrip().split(separator, 1)def main(separator='\t'):    # input comes from STDIN (standard input)    data = read_mapper_output(sys.stdin, separator=separator)    # groupby groups multiple word-count pairs by word,    # and creates an iterator that returns consecutive keys and their group:    #   current_word - string containing a word (the key)    #   group - iterator yielding all ["<current_word>", "<count>"] items    for current_word, group in groupby(data, itemgetter(0)):        try:            total_count = sum(int(count) for current_word, count in group)            print "%s%s%d" % (current_word, separator, total_count)        except ValueError:            # count was not a number, so silently discard this item            passif __name__ == "__main__":    main()

咱們再簡單點：

#!/usr/bin/env pythonimport sysfor line in sys.stdin:    line = line.strip()    words = line.split()    for word in words:        print '%s\t%s' % (word, 1)

#!/usr/bin/env python                                                                                                                                                                                                                                                                                                                                                              from operator import itemgetterimport sys                                                                                                                                                                                                                                                                                                                                                              current_word = Nonecurrent_count = 0word = None                                                                                                                                                                                                                                                                                                                                                              for line in sys.stdin:    line = line.strip()    word, count = line.split('\t', 1)    try:        count = int(count)    except ValueError:        continue    if current_word == word:        current_count += count    else:        if current_word:            print '%s\t%s' % (current_word, current_count)        current_count = count        current_word = word                                                                                                                                                                                                                                                                                                                                                             if current_word == word:    print '%s\t%s' % (current_word, current_count)

咱們就簡單類比下資料，跑個測試

650) this.width=650;" src="http://www.bkjia.com/uploads/allimg/131229/121I5B47-6.jpg" title="good.jpg" alt="084336884.jpg" />

剩下就沒啥了，在hadoop叢集環境下，運行hadoop的steaming.jar組件，加入mapreduce的指令碼，指定輸出就行了. 下面的例子我用的是shell的成分。

[root@101 cron]#$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar \-input myInputDirs \-output myOutputDir \-mapper cat \-reducer wc

詳細的參數，對於咱們來說提供效能可以把tasks的任務數增加下，根據情況自己測試下，也別太高了，增加負擔。

1）-input：輸入檔案路徑

2）-output：輸出檔案路徑

3）-mapper：使用者自己寫的mapper程式，可以是可執行檔或者指令碼

4）-reducer：使用者自己寫的reducer程式，可以是可執行檔或者指令碼

5）-file：打包檔案到提交的作業中，可以是mapper或者reducer要用的輸入檔案，如設定檔，字典等。

6）-partitioner：使用者自訂的partitioner程式

7）-combiner：使用者自訂的combiner程式必須用java實現）

8）-D：作業的一些屬性以前用的是-jonconf），具體有：
1）mapred.map.tasks：map task數目
2）mapred.reduce.tasks：reduce task數目
3）stream.map.input.field.separator/stream.map.output.field.separator： map task輸入/輸出數
據的分隔字元,預設均為\t。
4）stream.num.map.output.key.fields：指定map task輸出記錄中key所佔的域數目
5）stream.reduce.input.field.separator/stream.reduce.output.field.separator：reduce task輸入/輸出資料的分隔字元，預設均為\t。
6）stream.num.reduce.output.key.fields：指定reduce task輸出記錄中key所佔的域數目

這裡是統計dns的記錄檔有多少行 ~

650) this.width=650;" src="http://www.bkjia.com/uploads/allimg/131229/121I56395-7.jpg" title="2013-12-14_114409.jpg" alt="114512821.jpg" />

在mapreduce作為參數的時候，不能用太多太複雜的shell語言，他不懂的~

可以寫成shell檔案的模式；

#! /bin/bashwhile read LINE; do#  for word in $LINE#  do#    echo "$word 1"        awk '{print $5}'                                                                                                          donedone

#! /bin/bashcount=0started=0word=""while read LINE;do  goodk=`echo $LINE | cut -d ' '  -f 1`  if [ "x" == x"$goodk" ];then     continue  fi  if [ "$word" != "$goodk" ];then    [ $started -ne 0 ] && echo -e "$word\t$count"    word=$goodk                                                                                                                     count=1    started=1  else    count=$(( $count + 1 ))  fidone

有時候會出現這樣的問題，好好看看自己寫的mapreduce程式 ~

13/12/14 13:26:52 INFO streaming.StreamJob: Tracking URL: http://101.rui.com:50030/jobdetails.jsp?jobid=job_201312131904_0030

13/12/14 13:26:53 INFO streaming.StreamJob: map 0% reduce 0%

13/12/14 13:27:16 INFO streaming.StreamJob: map 100% reduce 100%

13/12/14 13:27:16 INFO streaming.StreamJob: To kill this job, run:

13/12/14 13:27:16 INFO streaming.StreamJob: /usr/local/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201312131904_0030

13/12/14 13:27:16 INFO streaming.StreamJob: Tracking URL: http://101.rui.com:50030/jobdetails.jsp?jobid=job_201312131904_0030

13/12/14 13:27:16 ERROR streaming.StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201312131904_0030_m_000000

13/12/14 13:27:16 INFO streaming.StreamJob: killJob...

Streaming Command Failed!

python做為mapreduce執行成功後，結果和日誌一般是放在你指定的目錄下的，結果是在part-00000檔案裡面~

650) this.width=650;" src="http://www.bkjia.com/uploads/allimg/131229/121I56010-8.jpg" title="2013-12-14_151217.jpg" alt="172550144.jpg" />

下面咱們談下，如何入庫和背景執行

本文出自 “峰雲，就她了。” 部落格，謝絕轉載！

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More