Hadoop Streaming Anaconda Python calculates the average

Last Update:2017-06-05 Source: Internet

Author: User

Tags stdin python script hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The Python version of the original Liunx is not numpy, and Anaconda Python cannot be invoked with Hadoop streaming when Anaconda is installed.

Later found that the parameters are not set up well ...

Go to the Chase:

Environment:

4 Servers: Master slave1 slave2 slave3.

all installed anaconda2 and Anaconda3, the main environment py2. Anaconda2 and Anaconda3 coexistence see:Ubuntu16.04 liunx installation Anaconda2 and Anaconda3

Installation directory:/home/orient/anaconda2

Hadoop version 2.4.0

Data Preparation:

InputFile.txt altogether 100 numbers of all data downloads:

0.9704130.9018170.8286980.1977440.4668870.9621470.1872940.3885090.2438890.1157320.6162920.7134360.7614460.9441230.200903

Writing mrmeanmapper.py

1 #!/usr/bin/env python2 ImportSYS3  fromNumPyImportmat, mean, power4 5 defread_input (file):6      forLineinchFile:7         yieldLine.rstrip ()8         9input = Read_input (Sys.stdin)#creates a list of input linesTeninput = [Float (line) forLineinchInput#Overwrite with floats OneNuminputs =len (Input) Ainput =mat (Input) -Sqinput = Power (input,2) -  the #output size, mean, mean (square values) - Print "%d\t%f\t%f"% (numinputs, mean (input), mean (Sqinput))#calc mean of columns - Print>> Sys.stderr,"Report:still Alive"

Writing mrmeanreducer.py

1 #!/usr/bin/env python2 ImportSYS3  fromNumPyImportmat, mean, power4 5 defread_input (file):6      forLineinchFile:7         yieldLine.rstrip ()8        9input = Read_input (Sys.stdin)#creates a list of input linesTen  One #split input lines into separate items and stores in list of lists AMapperout = [Line.split ('\ t') forLineinchinput] -  - #accumulate total number of samples, overall sum and overall sum sq thecumval=0.0 -cumsumsq=0.0 -cumn=0.0 -  forInstanceinchmapperout: +NJ =float (instance[0]) -Cumn + =NJ +Cumval + = Nj*float (instance[1]) ACUMSUMSQ + = Nj*float (instance[2]) at      - #Calculate means -mean = cumval/Cumn -MEANSQ = cumsumsq/Cumn -  - #output size, mean, mean (square values) in Print "%d\t%f\t%f"%(Cumn, mean, meansq) - Print>> Sys.stderr,"Report:still Alive"

Local test mrmeanmapper.py, mrmeanreducer.py

Cat

I put inputfile.txt,mrmeanmapper.py, mrmeanreducer.py all in the same directory ~/zhangle/ch15/hh/hh

All the operations are also under this directory!!!

uploading inputFile.txt to HDFs


Zhangle/mrmean-i is the directory on HDFs

Hadoop fs-put inputFile.txt Zhangle/mrmean-i

running Hadoop streaming

1Hadoop jar/usr/programs/hadoop-2.4.0/share/hadoop/tools/lib/hadoop-streaming-2.4.0. Jar2-input zhangle/mrmean-I3-output zhangle/output12222 4-filemrmeanmapper.py5-filemrmeanreducer.py6-mapper"/home/orient/anaconda2/bin/python mrmeanmapper.py" 7-reducer"/home/orient/anaconda2/bin/python mrmeanreducer.py"

Parameter explanation:

First line:/usr/programs/hadoop-2.4.0/share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar is the directory where my Hadoop streaming resides

Second line: Zhangle/mrmean-i is the directory you just uploaded inputFile.txt

Third line: zhangle/mrmean-out12222 is the result output directory, also on HDFs

Line four: mrmeanmapper.py is the mapper program in the current directory

Line five: mrmeanrdeducer.py is the reducer program in the current directory

Line six:/home/orient/anaconda2/bin/python is python under the Anaconda2 directory, and if it is removed, it will call the python that comes with it, and Python does not have the Python package installed NumPy.!!

Line seventh: Same as line sixth.

To view the results of a run:

Hadoop FS-cat zhangle/output12222/part-00000

Problem Solving

1. "Error:java.lang.RuntimeException:PipeMapRed.waitOutputThreads (): Subprocess failed with code 1 "Error

Workaround:

Before implementing MapReduce on Hadoop, be sure to run your Python program locally to see

First enter the folder containing map and reduce two py script files and data files InputFile.txt. Then enter the command to see if it performs the pass:
Cat InputFile.txt |python mrmeanmapper.py |python mrmeanreducer.py

2. error: "Error:java.lang.RuntimeException:PipeMapRed.waitOutputThreads (): Subprocess failed with code 2 ", or if the jar file cannot be found, or if the output folder already exists.

mapper.py and reduce.py the front to add: #!/usr/bin/env python this statement
In the Hadoop streaming command, be sure to enter the following format

1Hadoop jar/usr/programs/hadoop-2.4.0/share/hadoop/tools/lib/hadoop-streaming-2.4.0. Jar2-input zhangle/mrmean-I3-output zhangle/output122224-filemrmeanmapper.py5-filemrmeanreducer.py6-mapper"/home/orient/anaconda2/bin/python mrmeanmapper.py" 7-reducer"/home/orient/anaconda2/bin/python mrmeanreducer.py"

To ensure that the path to the jar file is correct, the Hadoop 2.4 version of the file is saved in:$HADOOP _home/share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar , Different Hadoop versions may have slightly different output folders in HDFs (this is the /user/hadoop/mr-ouput13under HDFs) and must be a new (previously nonexistent) folder, because even if the previous Hadoop If the streaming command does not succeed, the output folder will still be created according to your command, and then the Hadoop streaming command will appear when you use the same output folder, and the output folder already exists error. Parameters – File is followed by the map and reduce scripts, the path can be a detailed absolute path, or the current path, the current path must have the Mapper,reducer function, but after the parameters-mapper and-reducer, You need to specify the environment directory for the Python script and enclose it in quotation marks.

3. error occurred: "Error:java.lang.RuntimeException:PipeMapRed.waitOutputThreads (): subprocess failed with Code 127".

The problem with the scripting environment is to add the Python environment directory to lines sixth and seventh.

Reference:

Http://www.cnblogs.com/lzllovesyl/p/5286793.html

Http://www.zhaizhouwei.cn/hadoop/190.html

http://blog.csdn.net/wangzhiqing3/article/details/8633208

Hadoop Streaming Anaconda Python calculates the average

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More