Hadoop Streaming provides a toolkit for MapReduce programming that enables Mapper and Reducer based on executable commands, scripting languages, or other programming languages to take advantage of the benefits and capabilities of the Hadoop parallel computing framework, To handle big data.
All right, I admit the above is a copy. The following is the original dry goods
The first deployment of the Hadoop environment, this can be referred to http://www.powerxing.com/install-hadoop-in-centos/
All right, original, start with the next line.
After deploying Hadoop, you need to download the hadoop-streaming package, which can be http://www.java2s.com/Code/JarDownload/hadoop-streaming/ Hadoop-streaming-0.23.6.jar.zip to download, or visit http://www.java2s.com/Code/JarDownload/hadoop-streaming/to select the latest version, Do not choose source or the consequences of self-esteem, choose the compiled jar package can be placed in the/usr/local/hadoop directory backup
Next is a sample of big data statistics, I downloaded the mother-and-child purchase statistics on Ali's Tianchi Big Data Contest website, recorded the 900+ baby's purchase username, date of birth and gender information, Tianchi address https:// Tianchi.shuju.aliyun.com/datalab/index.htm
The data is a CSV file with the following structure:
Username, date of birth, gender (0 female, 1 male, 2 not willing to disclose sex)
For example: 415971,20121111,0 (data has been desensitization processing)
Let's try to count the number of male and female babies per year.
Next began to write mapper program mapper.py, because Hadoop-streaming is based on Unix pipe, the data will be input from the standard input sys.stdin, so the input is written Sys.stdin
#!/usr/bin/python#-*-coding:utf-8-*-ImportSYS forLineinchSys.stdin:line=Line.strip () data= Line.split (',') ifLen (data) <3: Continueuser_id=Data[0] Birthyear= Data[1][0:4] Gender= Data[2] Print>>sys.stdout,"%s\t%s"% (Birthyear,gender)
A very simple program, you can not understand the level of self-improvement posture
Here is the reduce program, where you need to be aware that during map to reduce, Hadoop automatically sorts the key out of the map, so it's a sorted key-value pair in reduce, which simplifies our programming
I am a reducer.py of the force of the primitive, and the outside of which is not the same as the flirtatious cheap goods
#!/usr/bin/python#-*-coding:utf-8-*-ImportSysgender_totle= {'0'70A'1': 0}prev_key=False forLineinchSys.stdin:#The key in the map will be sorted when the map isline =Line.strip () data= Line.split ('\ t') Birthyear=Data[0] Curr_key=birthyear Gender= Data[1] #Search for boundaries, output results ifPrev_key andCurr_key!=prev_key:#not for the first time, and to find the border Print>>sys.stdout,"%s year has male%s and female%s"% (prev_key,gender_totle['0'],gender_totle['1'])#output The results of the last statistic firstPrev_key =Curr_key gender_totle['0'] =0 gender_totle['1'] =0 gender_totle['2'] = 0#Qing 0Gender_totle[gender] +=1#Start Count Else: Prev_key=Curr_key Gender_totle[gender]+ = 1
The next step is to upload the sample and the mapper reducer into HDFs and execute it, which is where I step in the pit.
The first thing to do is to create the appropriate directory in HDFs, and for the sake of convenience, I alias part of the Hadoop command
Alias stop-dfs='/usr/local/hadoop/sbin/stop-dfs.sh'alias start-dfs=' /usr/local/hadoop/sbin/start-dfs.sh'alias dfs='/usr/local/ Hadoop/bin/hdfs dfs'
Once Hadoop is started, create a user directory first
DFS-mkdir -p/user/root
Upload a sample to this directory
Dfs-put./sample.csv/user/root
Of course it can be handled more standardized, the difference between the two will say
DFS-mkdir -p/user/root/-put./sample.csv/user/root/input
Next, mapper.py and reducer.py uploaded to the server, switch to upload the above two files directory, set up a input subdirectory, and copy the above files, I do not know why must be so, in short, do the right, and so I want to understand will come to update do not care about this detail
Then you can execute it, and execute the command as follows
Hadoop jar/usr/local/hadoop/hadoop-streaming-0.23. 6 . Jar -input sample.csv -output output-streaming -mapper mapper.py - Combiner reducer.py -reducer reducer.py -jobconf mapred.reduce.tasks= -File mapper.py -file reducer.py
If you put sample.csv into input, this command is written in English, but I haven't tried it anyway, it's none of my business.
Hadoop jar/usr/local/hadoop/hadoop-streaming-0.23. 6 . Jar -input input/sample.csv -output output-streaming -mapper mapper.py - Combiner reducer.py -reducer reducer.py -jobconf mapred.reduce.tasks= -File mapper.py -file reducer.py
Next is the exciting moment, to kneel very hard and press ENTER
If there is an error output-streaming already exists just use the command dfs-rm-r/user/root/output-streaming and then jump up and press ENTER
Don't be surprised if there's a strange brush screen. Yes, that's what mom taught me.
If the following words are successful
16/08/18 18:35:20 INFO MapReduce. Job: map 100% reduce 100%16/08/18 18:35:20 INFO mapreduce. Job:job JOB_LOCAL926114196_0001 completed successfully
Writing a Hadoop handler using python+hadoop-streaming