The Python version of the original Liunx is not numpy, and Anaconda Python cannot be invoked with Hadoop streaming when Anaconda is installed.
Later found that the parameters are not set up well ...
Go to the Chase:
Environment:
4 Servers: Master slave1 slave2 slave3.
all installed anaconda2 and Anaconda3, the main environment py2. Anaconda2 and Anaconda3 coexistence see:Ubuntu16.04 liunx installation Anaconda2 and Anaconda3
Installation directory:/home/orient/anaconda2
Hadoop version 2.4.0
Data Preparation:
InputFile.txt altogether 100 numbers of all data downloads:
0.9704130.9018170.8286980.1977440.4668870.9621470.1872940.3885090.2438890.1157320.6162920.7134360.7614460.9441230.200903
Writing mrmeanmapper.py
1 #!/usr/bin/env python2 ImportSYS3 fromNumPyImportmat, mean, power4 5 defread_input (file):6 forLineinchFile:7 yieldLine.rstrip ()8 9input = Read_input (Sys.stdin)#creates a list of input linesTeninput = [Float (line) forLineinchInput#Overwrite with floats OneNuminputs =len (Input) Ainput =mat (Input) -Sqinput = Power (input,2) - the #output size, mean, mean (square values) - Print "%d\t%f\t%f"% (numinputs, mean (input), mean (Sqinput))#calc mean of columns - Print>> Sys.stderr,"Report:still Alive"
Writing mrmeanreducer.py
1 #!/usr/bin/env python2 ImportSYS3 fromNumPyImportmat, mean, power4 5 defread_input (file):6 forLineinchFile:7 yieldLine.rstrip ()8 9input = Read_input (Sys.stdin)#creates a list of input linesTen One #split input lines into separate items and stores in list of lists AMapperout = [Line.split ('\ t') forLineinchinput] - - #accumulate total number of samples, overall sum and overall sum sq thecumval=0.0 -cumsumsq=0.0 -cumn=0.0 - forInstanceinchmapperout: +NJ =float (instance[0]) -Cumn + =NJ +Cumval + = Nj*float (instance[1]) ACUMSUMSQ + = Nj*float (instance[2]) at - #Calculate means -mean = cumval/Cumn -MEANSQ = cumsumsq/Cumn - - #output size, mean, mean (square values) in Print "%d\t%f\t%f"%(Cumn, mean, meansq) - Print>> Sys.stderr,"Report:still Alive"
Local test mrmeanmapper.py, mrmeanreducer.py
Cat
I put inputfile.txt,mrmeanmapper.py, mrmeanreducer.py all in the same directory ~/zhangle/ch15/hh/hh
All the operations are also under this directory!!!
uploading inputFile.txt to HDFs
Zhangle/mrmean-i is the directory on HDFs
Hadoop fs-put inputFile.txt Zhangle/mrmean-i
running Hadoop streaming
1Hadoop jar/usr/programs/hadoop-2.4.0/share/hadoop/tools/lib/hadoop-streaming-2.4.0. Jar2-input zhangle/mrmean-I3-output zhangle/output12222 4-filemrmeanmapper.py5-filemrmeanreducer.py6-mapper"/home/orient/anaconda2/bin/python mrmeanmapper.py" 7-reducer"/home/orient/anaconda2/bin/python mrmeanreducer.py"
Parameter explanation:
First line:/usr/programs/hadoop-2.4.0/share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar is the directory where my Hadoop streaming resides
Second line: Zhangle/mrmean-i is the directory you just uploaded inputFile.txt
Third line: zhangle/mrmean-out12222 is the result output directory, also on HDFs
Line four: mrmeanmapper.py is the mapper program in the current directory
Line five: mrmeanrdeducer.py is the reducer program in the current directory
Line six:/home/orient/anaconda2/bin/python is python under the Anaconda2 directory, and if it is removed, it will call the python that comes with it, and Python does not have the Python package installed NumPy.!!
Line seventh: Same as line sixth.
To view the results of a
run:
Hadoop FS-cat zhangle/output12222/part-00000
Problem Solving
1. "Error:java.lang.RuntimeException:PipeMapRed.waitOutputThreads (): Subprocess failed with code 1 "Error
Workaround:
Before implementing MapReduce on Hadoop, be sure to run your Python program locally to see
First enter the folder containing map and reduce two py script files and data files InputFile.txt. Then enter the command to see if it performs the pass:
Cat InputFile.txt |python mrmeanmapper.py |python mrmeanreducer.py
2. error: "Error:java.lang.RuntimeException:PipeMapRed.waitOutputThreads (): Subprocess failed with code 2 ", or if the jar file cannot be found, or if the output folder already exists.
3. error occurred: "Error:java.lang.RuntimeException:PipeMapRed.waitOutputThreads (): subprocess failed with Code 127".
The problem with the scripting environment is to add the Python environment directory to lines sixth and seventh.
Reference:
Http://www.cnblogs.com/lzllovesyl/p/5286793.html
Http://www.zhaizhouwei.cn/hadoop/190.html
http://blog.csdn.net/wangzhiqing3/article/details/8633208
Hadoop Streaming Anaconda Python calculates the average