We know that the hadoop streaming framework uses '/t' as the Separator by default, and takes the part before the first'/t' in each line as the key, and the remaining content as the value, if no '/t' separator exists, the entire line is used as the key.Key/TValueAnd serves as the reduce input. Hadoop provides configuration for you to set separators.
-DStream. Map. Output. Field. Separator: Set the delimiter between key and value in map output.
-D stream. Num. Map. Output. Key. Fields: Sets the location of the map program separator. The part before this location is used as the key, and the part after this is used as the value.
-D map. Output. Key. Field. separator: Set the delimiter inside the key in the map output.
-D num. key. fields. for. partition: specifies the number of columns (with-partitioner Org. apache. hadoop. mapred. lib. keyfieldbasedpartitioner)
-D stream. Reduce. Output. Field. Separator: Set the delimiter between key and value in reduce output.
-D stream. Num. Reduce. Output. Key. Fields: Set the location of the reduce program Separator
Instance:
1. Write map program Mapper. Sh; reduce program reducer. Sh; Test Data test.txt
Mapper. sh :#! /Bin/shcatreducer. sh :#! /Bin/shsorttest.txt content, 3, 1, 1
2. Put test.txt into hadoop and run it in two ways
1) run without separator settings
$ hadoop fs -put test.txt /app/test/$ hadoop streaming -input /app/test/test.txt /-output /app/test/test_result /-mapper ./mapper.sh -reducer ./reducer.sh -file mapper.sh -file reducer.sh /-jobconf mapred.reduce.tasks=2 /-jobconf mapre.job.name="sep_test"$ hadoop fs –cat /app/test/test_result/part-00000 1,2,2,1,1 1,3,1,1,1 1,3,1,1,1 1,3,3,1,1 1,3,3,1,1$ hadoop fs –cat /app/test/test_result/part-00001 1,2,1,1,1 1,2,3,1,1 1,3,2,1,1 1,3,2,1,1
2) set the delimiter to run
$ hadoop streaming -D stream.reduce.output.field.separator=, -D stream.num.reduce.output.key.fields=2 -input /app/test/test.txt -output /app/test/test_result_1 -mapper ./mapper.sh -reducer ./reducer.sh -file mapper.sh -file reducer.sh -jobconf mapred.reduce.tasks=2 -jobconf mapre.job.name="sep_test"$ hadoop fs -cat /app/test/test_result_1/part-00000 1,2 1,1,1 1,2 2,1,1 1,2 3,1,1$ hadoop fs -cat /app/test/test_result_1/part-00001 1,3 1,1,1 1,3 1,1,1 1,3 2,1,1 1,3 2,1,1 1,3 3,1,1 1,3 3,1,1