Hadoop Streaming Anaconda Python calculates the average

Source: Internet
Author: User
Tags stdin python script hadoop fs

The Python version of the original Liunx is not numpy, and Anaconda Python cannot be invoked with Hadoop streaming when Anaconda is installed.

Later found that the parameters are not set up well ...

Go to the Chase:

Environment:

4 Servers: Master slave1 slave2 slave3.

all installed anaconda2 and Anaconda3, the main environment py2. Anaconda2 and Anaconda3 coexistence see:Ubuntu16.04 liunx installation Anaconda2 and Anaconda3

Installation directory:/home/orient/anaconda2

Hadoop version 2.4.0

 Data Preparation:

InputFile.txt altogether 100 numbers of all data downloads:

0.9704130.9018170.8286980.1977440.4668870.9621470.1872940.3885090.2438890.1157320.6162920.7134360.7614460.9441230.200903

Writing mrmeanmapper.py
1 #!/usr/bin/env python2 ImportSYS3  fromNumPyImportmat, mean, power4 5 defread_input (file):6      forLineinchFile:7         yieldLine.rstrip ()8         9input = Read_input (Sys.stdin)#creates a list of input linesTeninput = [Float (line) forLineinchInput#Overwrite with floats OneNuminputs =len (Input) Ainput =mat (Input) -Sqinput = Power (input,2) -  the #output size, mean, mean (square values) - Print "%d\t%f\t%f"% (numinputs, mean (input), mean (Sqinput))#calc mean of columns - Print>> Sys.stderr,"Report:still Alive" 
 

Writing mrmeanreducer.py
1 #!/usr/bin/env python2 ImportSYS3  fromNumPyImportmat, mean, power4 5 defread_input (file):6      forLineinchFile:7         yieldLine.rstrip ()8        9input = Read_input (Sys.stdin)#creates a list of input linesTen  One #split input lines into separate items and stores in list of lists AMapperout = [Line.split ('\ t') forLineinchinput] -  - #accumulate total number of samples, overall sum and overall sum sq thecumval=0.0 -cumsumsq=0.0 -cumn=0.0 -  forInstanceinchmapperout: +NJ =float (instance[0]) -Cumn + =NJ +Cumval + = Nj*float (instance[1]) ACUMSUMSQ + = Nj*float (instance[2]) at      - #Calculate means -mean = cumval/Cumn -MEANSQ = cumsumsq/Cumn -  - #output size, mean, mean (square values) in Print "%d\t%f\t%f"%(Cumn, mean, meansq) - Print>> Sys.stderr,"Report:still Alive"

Local test mrmeanmapper.py, mrmeanreducer.py
Cat

I put inputfile.txt,mrmeanmapper.py, mrmeanreducer.py all in the same directory ~/zhangle/ch15/hh/hh

All the operations are also under this directory!!!

uploading inputFile.txt to HDFs

Zhangle/mrmean-i is the directory on HDFs
Hadoop fs-put inputFile.txt Zhangle/mrmean-i
running Hadoop streaming

1Hadoop jar/usr/programs/hadoop-2.4.0/share/hadoop/tools/lib/hadoop-streaming-2.4.0. Jar2-input zhangle/mrmean-I3-output zhangle/output12222 4-filemrmeanmapper.py5-filemrmeanreducer.py6-mapper"/home/orient/anaconda2/bin/python mrmeanmapper.py" 7-reducer"/home/orient/anaconda2/bin/python mrmeanreducer.py"

Parameter explanation:

First line:/usr/programs/hadoop-2.4.0/share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar is the directory where my Hadoop streaming resides

Second line: Zhangle/mrmean-i is the directory you just uploaded inputFile.txt

Third line: zhangle/mrmean-out12222 is the result output directory, also on HDFs

Line four: mrmeanmapper.py is the mapper program in the current directory

Line five: mrmeanrdeducer.py is the reducer program in the current directory

Line six:/home/orient/anaconda2/bin/python is python under the Anaconda2 directory, and if it is removed, it will call the python that comes with it, and Python does not have the Python package installed NumPy.!!

Line seventh: Same as line sixth.

To view the results of a run:
Hadoop FS-cat zhangle/output12222/part-00000

Problem Solving

1. "Error:java.lang.RuntimeException:PipeMapRed.waitOutputThreads (): Subprocess failed with code 1 "Error

Workaround:

Before implementing MapReduce on Hadoop, be sure to run your Python program locally to see

    • First enter the folder containing map and reduce two py script files and data files InputFile.txt. Then enter the command to see if it performs the pass:

    • Cat InputFile.txt |python mrmeanmapper.py |python mrmeanreducer.py

2. error: "Error:java.lang.RuntimeException:PipeMapRed.waitOutputThreads (): Subprocess failed with code 2 ", or if the jar file cannot be found, or if the output folder already exists.

  • mapper.py and reduce.py the front to add: #!/usr/bin/env python this statement
  • In the Hadoop streaming command, be sure to enter the following format
  • 1Hadoop jar/usr/programs/hadoop-2.4.0/share/hadoop/tools/lib/hadoop-streaming-2.4.0. Jar2-input zhangle/mrmean-I3-output zhangle/output122224-filemrmeanmapper.py5-filemrmeanreducer.py6-mapper"/home/orient/anaconda2/bin/python mrmeanmapper.py" 7-reducer"/home/orient/anaconda2/bin/python mrmeanreducer.py"

  • To ensure that the path to the jar file is correct, the Hadoop 2.4 version of the file is saved in:$HADOOP _home/share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar , Different Hadoop versions may have slightly different output folders in HDFs (this is the /user/hadoop/mr-ouput13under HDFs) and must be a new (previously nonexistent) folder, because even if the previous Hadoop If the streaming command does not succeed, the output folder will still be created according to your command, and then the Hadoop streaming command will appear when you use the same output folder, and the output folder already exists error. Parameters – File is followed by the map and reduce scripts, the path can be a detailed absolute path, or the current path, the current path must have the Mapper,reducer function, but after the parameters-mapper and-reducer, You need to specify the environment directory for the Python script and enclose it in quotation marks.

3. error occurred: "Error:java.lang.RuntimeException:PipeMapRed.waitOutputThreads (): subprocess failed with Code 127".

The problem with the scripting environment is to add the Python environment directory to lines sixth and seventh.

Reference:

Http://www.cnblogs.com/lzllovesyl/p/5286793.html

Http://www.zhaizhouwei.cn/hadoop/190.html

http://blog.csdn.net/wangzhiqing3/article/details/8633208

Hadoop Streaming Anaconda Python calculates the average

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.