650) this. width = 650; "src =" http://www.bkjia.com/uploads/allimg/131229/121I551N-0.jpg "title =" kv.jpg "alt =" 224356461.jpg"/>
If a log with a large traffic volume is directly written into Hadoop, the load on the Namenode is too large, so the logs of each node can be merged before the database is imported into the database and written into HDFS as a file. Merge regularly as needed and write it into hdfs.
Let's take a look at the log size. I compress the GB dns log file to 18 GB. If I use awk perl, I certainly can, but the processing speed is definitely not as powerful as distributed.
650) this. width = 650; "src =" http://www.bkjia.com/uploads/allimg/131229/121I5K42-1.jpg "title =" 2013-12-12_225703.jpg "alt =" 230102727.jpg"/>
Principles of Hadoop Streaming
Mapper and reducer read user data from the standard input, and send the data to the standard output after processing one row. The Streaming tool creates a MapReduce job and sends it to each tasktracker. It also monitors the execution process of the entire job.
Mapreduce can be implemented in any language as long as standard input and output are easily received ~
Before doing this, let's test the performance speed of shell simulation mapreduce ~
650) this. width = 650; "src =" http://www.bkjia.com/uploads/allimg/131229/121I510Y-2.jpg "title =" rsdfsdfr.jpg "alt =" 234955396.jpg"/>
Check his results. It takes about 35 seconds for a M file.
650) this. width = 650; "src =" http://www.bkjia.com/uploads/allimg/131229/121I55Q1-3.jpg "title =" 333.jpg" alt = "235045406.jpg"/>
This is a 2G log file, and it took 3 minutes. Of course there is a problem with the script I wrote. We simulate mapreduce, instead of calling awk and gawk under shell.
650) this. width = 650; "src =" http://www.bkjia.com/uploads/allimg/131229/121I54K8-4.jpg "title =" 2013-12-12_234808.jpg "alt =" 001056805.jpg"/>
Awk speed! I really like awk when processing logs, but it is a little difficult to learn. It is not as flexible and simple as other shell components.
650) this. width = 650; "src =" http://www.bkjia.com/uploads/allimg/131229/121I51624-5.jpg "title =" 2013-12-13_000942.jpg "alt =" 000946258.jpg"/>
This is two official demos ~
Map. py
#!/usr/bin/env python"""A more advanced Mapper, using Python iterators and generators."""import sysdef read_input(file): for line in file: # split the line into words yield line.split()def main(separator='\t'): # input comes from STDIN (standard input) data = read_input(sys.stdin) for words in data: # write the results to STDOUT (standard output); # what we output here will be the input for the # Reduce step, i.e. the input for reducer.py # # tab-delimited; the trivial word count is 1 for word in words: print '%s%s%d' % (word, separator, 1)if __name__ == "__main__": main()
Reduce. py modification method
#!/usr/bin/env python"""A more advanced Reducer, using Python iterators and generators."""from itertools import groupbyfrom operator import itemgetterimport sysdef read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1)def main(separator='\t'): # input comes from STDIN (standard input) data = read_mapper_output(sys.stdin, separator=separator) # groupby groups multiple word-count pairs by word, # and creates an iterator that returns consecutive keys and their group: # current_word - string containing a word (the key) # group - iterator yielding all ["<current_word>", "<count>"] items for current_word, group in groupby(data, itemgetter(0)): try: total_count = sum(int(count) for current_word, count in group) print "%s%s%d" % (current_word, separator, total_count) except ValueError: # count was not a number, so silently discard this item passif __name__ == "__main__": main()
Let's make it simple:
#!/usr/bin/env pythonimport sysfor line in sys.stdin: line = line.strip() words = line.split() for word in words: print '%s\t%s' % (word, 1)
#!/usr/bin/env python from operator import itemgetterimport sys current_word = Nonecurrent_count = 0word = None for line in sys.stdin: line = line.strip() word, count = line.split('\t', 1) try: count = int(count) except ValueError: continue if current_word == word: current_count += count else: if current_word: print '%s\t%s' % (current_word, current_count) current_count = count current_word = word if current_word == word: print '%s\t%s' % (current_word, current_count)
Let's simply simulate the data and run a test.
650) this. width = 650; "src =" http://www.bkjia.com/uploads/allimg/131229/121I5B47-6.jpg "title =" good.jpg "alt =" 084300004.jpg"/>
Nothing left. In the hadoop cluster environment, run hadoop's steaming. jar component. Add the mapreduce script and specify the output. in the following example, I use shell components.
[root@101 cron]#$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar \-input myInputDirs \-output myOutputDir \-mapper cat \-reducer wc
Detailed parameters. For our performance, we can increase the number of tasks and test the task based on the actual situation. This increases the workload.
1)-input: input file path
2)-output: output file path
3)-mapper: The mapper program written by the user. It can be an executable file or script.
4)-CER: the reducer program written by the user. It can be an executable file or script.
5)-file: Package the file to the submitted job. It can be the input file used by mapper or reducer, such as the configuration file and dictionary.
6)-partitioner: User-Defined partitioner Program
7)-combiner: the User-Defined combiner program must be implemented in java)
8)-D: Some attributes of the job used previously-jonconf), specifically:
1) mapred. map. tasks: Number of map tasks
2) mapred. reduce. tasks: Number of reduce tasks
3) stream. map. input. field. separator/stream. map. output. field. separator: number of inputs/outputs of map tasks
Data Separator. The default value is \ t.
4) stream. num. map. output. key. fields: specify the number of fields in the output record of the map task.
5) stream. reduce. input. field. separator/stream. reduce. output. field. separator: delimiter of input/output data of reduce task. The default value is \ t.
6) stream. num. reduce. output. key. fields: specify the number of fields in the output record of the reduce task.
Here is how many lines of dns log files ~
650) this. width = 650; "src =" http://www.bkjia.com/uploads/allimg/131229/121I56395-7.jpg "title =" 2013-12-14_1120.9.jpg "alt =" 114512821.jpg"/>
When mapreduce is used as a parameter, it cannot use too many shell languages that are too complex ~
It can be written as a shell file;
#! /bin/bashwhile read LINE; do# for word in $LINE# do# echo "$word 1" awk '{print $5}' donedone
#! /bin/bashcount=0started=0word=""while read LINE;do goodk=`echo $LINE | cut -d ' ' -f 1` if [ "x" == x"$goodk" ];then continue fi if [ "$word" != "$goodk" ];then [ $started -ne 0 ] && echo -e "$word\t$count" word=$goodk count=1 started=1 else count=$(( $count + 1 )) fidone
Sometimes this problem occurs. Take a good look at your own mapreduce program ~
13/12/14 13:26:52 INFO streaming. StreamJob: Tracking URL: http://101.rui.com: 50030/jobdetails. jsp? Jobid = job_201312131904_0030
13/12/14 13:26:53 INFO streaming. StreamJob: map 0% reduce 0%
13/12/14 13:27:16 INFO streaming. StreamJob: map 100% reduce 100%
13/12/14 13:27:16 INFO streaming. StreamJob: To kill this job, run:
13/12/14 13:27:16 INFO streaming. StreamJob:/usr/local/hadoop/libexec/../bin/hadoop job-Dmapred. job. tracker = localhost: 9001-kill job_201312131904_0030
13/12/14 13:27:16 INFO streaming. StreamJob: Tracking URL: http://101.rui.com: 50030/jobdetails. jsp? Jobid = job_201312131904_0030
13/12/14 13:27:16 ERROR streaming. StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201312131904_0030_m_000000
13/12/14 13:27:16 INFO streaming. StreamJob: killJob...
Streaming Command Failed!
After python is successfully executed as mapreduce, the results and logs are generally stored in the directory you specified. The results are in the part-00000 file ~
650) this. width = 650; "src =" http://www.bkjia.com/uploads/allimg/131229/121I56010-8.jpg "title =" 2013-12-14_151217.jpg "alt =" 172550144.jpg"/>
Next, let's talk about how to import data to the database and perform operations in the background.
This article is from "Fengyun, it's her ." Blog, declined to reprint!