Hadoop streaming practice: splitting output files

Source: Internet
Author: User
Tags hadoop fs

We know that the hadoop streaming framework uses '/t' as the Separator by default, and takes the part before the first'/t' in each line as the key, and the remaining content as the value, if no '/t' separator exists, the entire line is used as the key.Key/TValueAnd serves as the reduce input. Hadoop provides configuration for you to set separators.

-DStream. Map. Output. Field. Separator: Set the delimiter between key and value in map output.

-D stream. Num. Map. Output. Key. Fields: Sets the location of the map program separator. The part before this location is used as the key, and the part after this is used as the value.

-D map. Output. Key. Field. separator: Set the delimiter inside the key in the map output.
-D num. key. fields. for. partition: specifies the number of columns (with-partitioner Org. apache. hadoop. mapred. lib. keyfieldbasedpartitioner)
-D stream. Reduce. Output. Field. Separator: Set the delimiter between key and value in reduce output.

-D stream. Num. Reduce. Output. Key. Fields: Set the location of the reduce program Separator

Instance:
1. Write map program Mapper. Sh; reduce program reducer. Sh; Test Data test.txt

Mapper. sh :#! /Bin/shcatreducer. sh :#! /Bin/shsorttest.txt content, 3, 1, 1

2. Put test.txt into hadoop and run it in two ways
1) run without separator settings

$ hadoop fs -put test.txt /app/test/$ hadoop streaming -input /app/test/test.txt /-output /app/test/test_result /-mapper ./mapper.sh -reducer ./reducer.sh -file mapper.sh -file reducer.sh /-jobconf mapred.reduce.tasks=2 /-jobconf mapre.job.name="sep_test"$ hadoop fs –cat /app/test/test_result/part-00000    1,2,2,1,1    1,3,1,1,1    1,3,1,1,1    1,3,3,1,1    1,3,3,1,1$ hadoop fs –cat /app/test/test_result/part-00001    1,2,1,1,1    1,2,3,1,1    1,3,2,1,1    1,3,2,1,1

2) set the delimiter to run

$ hadoop streaming -D stream.reduce.output.field.separator=,         -D stream.num.reduce.output.key.fields=2          -input /app/test/test.txt        -output /app/test/test_result_1         -mapper ./mapper.sh -reducer ./reducer.sh             -file mapper.sh   -file reducer.sh         -jobconf mapred.reduce.tasks=2         -jobconf mapre.job.name="sep_test"$ hadoop fs -cat /app/test/test_result_1/part-00000    1,2     1,1,1    1,2     2,1,1    1,2     3,1,1$ hadoop fs -cat /app/test/test_result_1/part-00001    1,3     1,1,1    1,3     1,1,1    1,3     2,1,1    1,3     2,1,1    1,3     3,1,1    1,3     3,1,1

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.