Original, forwarding please indicate the source.
MapReduce is the core of the Hadoop elephant, and the core of data processing in Hadoop is the MapReduce program design model. A map/reduce job typically divides the input dataset into separate pieces of data that are processed in a completely parallel manner by the Map Task (Task) . The framework sorts the output of the map first and then inputs the results to the reduce task . Usually the inputs and outputs of the job are stored in the file system. Therefore, our programming center is mainly the mapper stage and the reducer stage.
Below to develop a mapreduce program from zero and run it on a Hadoop cluster.
Mapper Code map.py:
Import SYS for inch Sys.stdin: = Line.strip (). Split (") for in word_list: Print ' \ t '
View Code
Reducer Code reduce.py:
ImportSYS Cur_word=None sum=0 forLineinchSys.stdin:ss= Line.strip (). Split ('\ t') ifLen (ss) < 2: ContinueWord=Ss[0].strip () Count= Ss[1].strip ()ifCur_word = =None:cur_word=WordifCur_word! =Word:Print '\ t'. Join ([Cur_word, str (sum)]) Cur_word=Word sum=0 Sum+=Int (count)Print '\ t'. Join ([Cur_word, str (sum)]) sum= 0View Code
Resource file Src.txt (for testing, when running in a cluster, remember to upload to HDFs):
Hello ni hao ni haoni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ao ni haoni hao ni hao Ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni haoao ni haoni hao ni hao ni hao ni hao ni hao ni hao ni hao n I hao ni hao ni hao ni hao dad would get out he mandolin and play for the family Dad loved to play the mandolin F Or his family he knew we enjoyed singing I had to mature into a man and has children of my own before I realized how Much he had sacrificed I had to,mature into a man and,have children of my own before. I realized how much he had sacrificed
View Code
First local debugging to see if the results are correct, enter the following command:
Cat Src.txt | Python map.py | Sort-k 1 | Python reduce.py
Results from the command line output:
A 2 and 2 and,have 1 ao 1 before 1 before. I 1 Children 2 Dad 2 enjoyed 1 family 2 for 2 get 1 had 4 hao haoao 1 haoni 3 have 1 He 3 Hello 1 his 2 what 2 I 3 into 2 knew 1 loved 1 man 2 mandolin 2 mature 1 much 2 my 2 ni 2 out 1 own 2 play 2 realized 2 sacrificed 2 singing 1 the 2 to 2 to,mature 1 We 1 would 1
View Code
Debug to find local debugging, the code is OK. Below the cluster to run. For convenience, specially wrote a script run.sh, Liberation Labor.
Hadoop_cmd="/home/hadoop/hadoop/bin/hadoop"Stream_jar_path="/home/hadoop/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar"Input_file_path="/home/input/src.txt"Output_path="/home/output"$HADOOP _cmd FS-RMR $OUTPUT _path $HADOOP _cmd jar $STREAM _jar_path-input $INPUT _file_path-Output $OUTPUT _path-mapper"python map.py" -reducer"python reduce.py" -file./map.py-file./reduce.py
The following script is parsed:
Path of the Hadoop_cmd:hadoop bin stream_jar_path:streaming the path to the JAR package input_file_path:hadoop The resource input path on the cluster output_path:h Adoop The result output path on the cluster. (Note: This directory should not exist, so the script is added first to delete this directory.) * * NOTE * * * * NOTE * * *: If the first execution, without this directory, will be an error. You can create a new output directory manually. ) $HADOOP _cmd FS-RMR $OUTPUT _path $HADOOP _cmd jar $STREAM _jar_path-input $INPUT _file_path-Output $OUTPUT _path-mapper"python map.py" -reducer"python reduce.py" -file./map.py-file./reduce.py#Here fixed format, specify input, Output path, specify Mapper,reducer file, and distribute Mapper,reducer role of our user write code file, because other nodes of the cluster do not have mapper, reducer executable file.
Enter the following command to view the records that were output after the reduce phase:
Cat Src.txt | Python map.py | Sort-k 1 | Python reduce.py | Wc-l
Command line output: 43
In Browser input: master:50030 View the details of the task.
Kind % Complete Num Tasks Pending Running complete killed failed/killed Task Attemptsmap 100.00% 2 0 0 2 0 0/0reduce 100.00% 1 0 0 1 0 0/0
See this in the Map-reduce framework.
Counter Map Reduce totalreduce output records 0 0 43
Demonstrate the success of the process. The development of the first Hadoop program ends.
Python Development MapReduce Series (i) WordCount Demo