Streaming supports using scripts as map and reduceProgram. The following describes a program for calculating the total number of rows of all files in a distributed manner.
1. Put the data to be retrieved into HDFS
$ Hadoop FS-put localfile/user/hadoop/hadoopfile
2. Write the map and reduce scripts. Remember to add executable permissions to the scripts.
Mapper. Sh
#! /Bin/sh <br/> WC-l
CER Cer. Sh
#! /Bin/sh <br/> sum = 0 <br/> while read I <br/> DO <br/> let sum + = $ I <br/> done <br /> echo $ sum
3. run:
$ hadoop streaming-Input/user/hadoop/hadoopfile-output/user/hadoop/result-mapper. /mapper. sh-reducer. /CER Cer. sh-file mapper. sh-file CER Cer. sh-jobconf mapred. reduce. tasks = 1-jobconf mapre. job. name = "sum_test"
Note:
-Input/user/hadoop/hadoopfile: directory of files to be processed
-output/user/hadoop/result: processing result storage directory
-file: Distribution mapper. SH and reducer. sh to HDFS
-jobconf mapred. reduce. tasks = 1: Number of reduce tasks
3. view the result:
$ hadoop FS-CAT/user/hadoop/result/part-00000