Writing a Hadoop handler using python+hadoop-streaming

Last Update:2016-08-18 Source: Internet

Author: User

Tags stdin hdfs dfs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop Streaming provides a toolkit for MapReduce programming that enables Mapper and Reducer based on executable commands, scripting languages, or other programming languages to take advantage of the benefits and capabilities of the Hadoop parallel computing framework, To handle big data.

All right, I admit the above is a copy. The following is the original dry goods

The first deployment of the Hadoop environment, this can be referred to http://www.powerxing.com/install-hadoop-in-centos/

All right, original, start with the next line.

After deploying Hadoop, you need to download the hadoop-streaming package, which can be http://www.java2s.com/Code/JarDownload/hadoop-streaming/ Hadoop-streaming-0.23.6.jar.zip to download, or visit http://www.java2s.com/Code/JarDownload/hadoop-streaming/to select the latest version, Do not choose source or the consequences of self-esteem, choose the compiled jar package can be placed in the/usr/local/hadoop directory backup

Next is a sample of big data statistics, I downloaded the mother-and-child purchase statistics on Ali's Tianchi Big Data Contest website, recorded the 900+ baby's purchase username, date of birth and gender information, Tianchi address https:// Tianchi.shuju.aliyun.com/datalab/index.htm

The data is a CSV file with the following structure:

Username, date of birth, gender (0 female, 1 male, 2 not willing to disclose sex)

For example: 415971,20121111,0 (data has been desensitization processing)

Let's try to count the number of male and female babies per year.

Next began to write mapper program mapper.py, because Hadoop-streaming is based on Unix pipe, the data will be input from the standard input sys.stdin, so the input is written Sys.stdin

#!/usr/bin/python#-*-coding:utf-8-*-ImportSYS forLineinchSys.stdin:line=Line.strip () data= Line.split (',')    ifLen (data) <3:        Continueuser_id=Data[0] Birthyear= Data[1][0:4] Gender= Data[2]    Print>>sys.stdout,"%s\t%s"% (Birthyear,gender)

A very simple program, you can not understand the level of self-improvement posture

Here is the reduce program, where you need to be aware that during map to reduce, Hadoop automatically sorts the key out of the map, so it's a sorted key-value pair in reduce, which simplifies our programming

I am a reducer.py of the force of the primitive, and the outside of which is not the same as the flirtatious cheap goods

#!/usr/bin/python#-*-coding:utf-8-*-ImportSysgender_totle= {'0'70A'1': 0}prev_key=False forLineinchSys.stdin:#The key in the map will be sorted when the map isline =Line.strip () data= Line.split ('\ t') Birthyear=Data[0] Curr_key=birthyear Gender= Data[1]        #Search for boundaries, output results    ifPrev_key andCurr_key!=prev_key:#not for the first time, and to find the border        Print>>sys.stdout,"%s year has male%s and female%s"% (prev_key,gender_totle['0'],gender_totle['1'])#output The results of the last statistic firstPrev_key =Curr_key gender_totle['0'] =0 gender_totle['1'] =0 gender_totle['2'] = 0#Qing 0Gender_totle[gender] +=1#Start Count    Else: Prev_key=Curr_key Gender_totle[gender]+ = 1

The next step is to upload the sample and the mapper reducer into HDFs and execute it, which is where I step in the pit.

The first thing to do is to create the appropriate directory in HDFs, and for the sake of convenience, I alias part of the Hadoop command

Alias stop-dfs='/usr/local/hadoop/sbin/stop-dfs.sh'alias start-dfs='  /usr/local/hadoop/sbin/start-dfs.sh'alias dfs='/usr/local/ Hadoop/bin/hdfs dfs'

Once Hadoop is started, create a user directory first

DFS-mkdir -p/user/root

Upload a sample to this directory

Dfs-put./sample.csv/user/root

Of course it can be handled more standardized, the difference between the two will say

DFS-mkdir -p/user/root/-put./sample.csv/user/root/input

Next, mapper.py and reducer.py uploaded to the server, switch to upload the above two files directory, set up a input subdirectory, and copy the above files, I do not know why must be so, in short, do the right, and so I want to understand will come to update do not care about this detail

Then you can execute it, and execute the command as follows

Hadoop jar/usr/local/hadoop/hadoop-streaming-0.23. 6 . Jar   -input sample.csv  -output output-streaming  -mapper mapper.py  -  Combiner reducer.py  -reducer reducer.py  -jobconf mapred.reduce.tasks= -File mapper.py  -file reducer.py

If you put sample.csv into input, this command is written in English, but I haven't tried it anyway, it's none of my business.

Hadoop jar/usr/local/hadoop/hadoop-streaming-0.23. 6 . Jar   -input input/sample.csv  -output output-streaming  -mapper mapper.py  -  Combiner reducer.py  -reducer reducer.py  -jobconf mapred.reduce.tasks= -File mapper.py  -file reducer.py

Next is the exciting moment, to kneel very hard and press ENTER

If there is an error output-streaming already exists just use the command dfs-rm-r/user/root/output-streaming and then jump up and press ENTER

Don't be surprised if there's a strange brush screen. Yes, that's what mom taught me.

If the following words are successful

16/08/18 18:35:20 INFO MapReduce. Job:  map 100% reduce 100%16/08/18 18:35:20 INFO mapreduce. Job:job JOB_LOCAL926114196_0001 completed successfully

Writing a Hadoop handler using python+hadoop-streaming

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More