Hadoop Streaming parameter Configuration __hadoop

Source: Internet
Author: User
Tags hadoop mapreduce hadoop fs
Streaming Introduction

Hadoop streaming is a programming tool provided by Hadoop, streamining Framework allows any executable or script file to be used as mapper and reducer in Hadoop MapReduce. Easy to migrate existing programs to the Hadoop platform. So it can be said that the scalability of Hadoop is significant.

The principle of streamining: Mapper and Reducer will read data from standard input, row by row and send to standard output, streming tool will create MapReduce job, send to each tasktracker, Monitor the execution of the entire operation at the same time.

If a file (executable or script) is initialized as a mapper,mapper, each mapper task starts the file as a separate process, and when the task runs, it splits the input into rows and provides each row with the standard input for the executable process, while the Mapper collects the contents of the standard output of the executable process and converts each received line into a Key/value pair, as the output of mapper, by default, the portion before the first tab in a row as the key, followed by the value. If there is no tab, the entire row is null as the Key,value value.

Specific parameter tuning can refer to http://www.uml.org.cn/zjjs/201205303.asp basic usage

Hadoophome/bin/hadoopjar Hadoop_home/bin/hadoop jar\ hadoop_home/share/hadoop/tools/lib/ Hadoop-streaming-2.7.3.jar [Options]

Options

--input: Input file path
 --output: Output file path
 --mapper: The user writes the mapper program, can be an executable file or script
 -  -reducer: The user writes the reducer program, can be executable or script
 --file: Packaged files to a submitted job, can be mapper or reducer input files to use, such as configuration files, dictionaries, etc.
 --partitioner: User-defined Partitioner programs
 --combiner: User-defined combiner program
 -D: Some properties of the job (previously used-JONCOF)
Example 1:
${bin_hadoop} fs-rm-r "${output}"
${bin_hadoop} Jar "${jarpack}" \
    -input "${input}/part*" \
    -output "${" OUTPUT} "\
    -mapper" ${python_bin}  map.py "\
    -reducer" ${python_bin} reduce.py "\
    -jobconf Mapred.job.name= "* * *" \
    -jobconf mapred.job.priority=very_high \
    -jobconf mapred.job.map.capacity=200 \
    -jobconf mapred.job.reduce.capacity=200 \
    -jobconf mapred.reduce.tasks=200 \
    -jobconf mapred.min.split.size=20480000000 \
    -jobconf mapred.map.memory.limit=2000 \
    -jobconf mapred.map.over.capacity.allowed=true \
    -cachearchive "${python_pack} #taskenv" \
    -file map.py \
    - File init.conf

The example above uploads the Python package for the current environment (that is, compressing tar from bin include Lib share in the current environment in PYTHON) and uploading it to HDFs (Python_pack) to start streamining Use the-cachearchive option to distribute it to compute nodes and extract it into the taskenv directory, python_bin= "Taskenv/python27/bin/python" to the directory app instead of the app's upper directory when packaged locally, Otherwise, you will be able to access the mapper.pl file through app/app/mapper.pl. Example 2:

${bin_hadoop} fs-rm-r "${output}"
${bin_hadoop} Jar "${jarpack}" \-
    d mapreduce.map.memory.mb=2800 \
    D mapred.map.tasks=400 \
    d stream.non.zero.exit.is.failure=false \
    -file map.py \
    -input "${input}" \
    -output "${output}" \
    -mapper map.py \
    -jobconf mapred.map.over.capacity.allowed=true \
    -cachefile "${" Python_file} #tasklink "\
    -jobconf mapred.job.map.capacity=20 \
    -jobconf mapreduce.job.name=" * * * "

The example above uploads a file where map.py can access ${python_file} files directly through ("./tasklink"). streamining Insufficient Hadoop streamining can only handle text data, can not directly handle binary data processing streaming in the mapper and reducer default only standard output write data, can not easily handle the multiple output base configuration parameters introduced

-files,-archives,-libjars the Hadoop command -file the client local file into a jar package on the HDFs and then distribution to the compute node - Cachefile distribute HDFs files to the compute node -cachearchive Distribute HDFs compressed files to the compute node and extract them, or you can specify a symbolic link - Files: distributes the specified local/hdfs file to the working directory of each task, and does nothing with the file , for example:-files hdfs://host:fs_port/user/testfile.txt#testlink The -archives option allows the user to copy the work path of the jar package to the current task and automatically unzip the jar package , for example:-archives Hdfs://host:fs_port/user/testfile.jar #testlink3. In this example, the symbolic link testlink3 is created in the working path of the current task. The symbolic link points to the file path-libjars: Specifies the jar package to be distributed, and when Hadoop distributes the jar packages to each node, it is automatically added to the task's classpath environment variable -input: Specify job input, Can be a file or directory, you can use the * wildcard character, or you can use multiple files or directories as input -output: Specify the job output directory, and must not exist, and must have permission to create the directory,-output can only use one -mapper: Specifies mapper executable or Java class, must specify and unique -reducer: Specify reducer executable or Java class - Numreducetasks: Specifies the number of reducer, if set to 0 or-reducer none there is no reducer program, mapper output directly as the output of the entire job other parameter configuration

General Use-jobconf | -d name=value Specify job parameters, specify mapper or reducer task official said to use-jobconf but this parameter is obsolete, not available, the official said to use-D, note that this-D is to be used as the first configuration, because it is in the Maper and reducer before execution, you need to rigidly specify the good, so you want to appear at the top of the parameter./bin/hadoop jar hadoop-0.19.2-streaming.jar-d ...-input ... Something like that.mapred.job.name= "JobName" Set job name, Special recommendation mapred.job.priority=very_high| high| normal| low| Very_low Set Job Priority* Mapred.job.map.capacity=m setting up to run M map task * Mapred.job.reduce.capacity=n set up to run up to N reduce tasks at the same timeMapred.map.tasksSet the number of map tasks Mapred.reduce.tasks set the number of reduce tasks mapred.job.groups the calculated node group that a job can run mapred.task.timeout the maximum time that a task does not respond (input and output) Mapred.compress.map.output sets the output of the map to compress Mapred.map.output.compression.codec set the output compression of the map mapred.output.compress Set the output of the map to compress Mapred.output.compression.codec set reduce output compression modestream.map.output.field.separator Specifies the map output separator, by default the streaming framework takes the part of the first ' \ t ' of each row of the map output as a key, followed by the part as the Value,key\value and then as the input of reduce;Stream.num.map.output.key.fields Set the location of the separator, if set to 2, specifies to split at the 2nd delimiter, which is preceded by the key, then as value, and if the delimiter is less than 2, the integer behavior key,value is empty stream.reduce.output.field.separator Specify the reduce output separator stream.num.reduce.output.key.fields Specify the reduce output separator location

Where the sort and partition parameters are used more
Delimiter inside Key in Map.output.key.field.separator:map
Num.key.fields.for.partition: The number of columns used for the barrel key, after the key is separated by the delimiter specified earlier. In layman's terms, the same key will be hit to the same reduce in the first few columns of partition.
-partitioner Org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner The first two parameters to be used with the partitioner option.

Key and value separator in Stream.map.output.field.separator:map
Position of separator in Stream.num.map.output.key.fields:map
The delimiter of key and value in Stream.reduce.output.field.separator:reduce
Position of separator in Stream.num.reduce.output.key.fields:reduce

In addition, there are compression parameters

Job output results are compressed

Mapred.output.compress

is compressed, the default value is False.

Mapred.output.compression.type

Compression type, with none, record and block, default value record.

Mapred.output.compression.codec

Compression algorithm, default value ORG.APACHE.HADOOP.IO.COMPRESS.DEFAULTCODEC.

Whether the map task output is compressed

Mapred.compress.map.output

Compression, default value False

Mapred.map.output.compression.codec

Compression algorithm, default value ORG.APACHE.HADOOP.IO.COMPRESS.DEFAULTCODEC

In addition, Hadoop itself brings some handy mapper and reducer:

1, Hadoop aggregation function: Aggregate provides a special reducer class and a special Combiner class, and has a series of "aggregator" (for example: Sum, max, MIN, etc.) used to aggregate a set of value sequences. You can use Aggredate to define a mapper plug-in class that generates "" for each key/value pair that you enter for mapper. Combiner/reducer aggregates these aggregations using the appropriate aggregator. To use aggregate, simply specify "-reducer aggregate".
2. Selection of fields (similar to ' cut ' in Unix): Hadoop's Tool class Org.apache.hadoop.mapred.lib.FieldSelectionMapReduc helps users efficiently process text data, just like the "cut" in Unix Tools The map function in the tool class regards the input key/value as a list of fields. The user can specify the separator for the field (the default is tab), and you can select any paragraph in the field list (consisting of one or more fields in the list) as the key or value for the map output. Similarly, the reduce function in the tool class also considers the input key/value as a list of fields, and the user can select any section as the key or value for reduce output. two times sorted partitioner

A typical map-reduce process consists of: Input->map->patition->reduce->output, Partition is responsible for distributing the intermediate results of the map task output to different reduce tasks by key. Hadoop provides a very practical partitioner class Keyfieldbasedpartitioner, using the method:

will generally cooperate with

-D map.output.key.field.separator  # # #key内部的分隔符-
D num.key.fields.for.partition    ### The first part of the key is Parititon, not the whole key.

Yes, but Map.output.key.field.separator does not seem to work on the TAB key, which is to be verified . Example 3:

Test data:
1,2,1,1,1

1,2,2,1,1

1,3,1,1,1

1,3,2,1,1

1,3,3,1,1

1,2,3,1,1

1,3,1,1,1

1,3,2,1,1

1,3,3,1,1

-D stream.map.output.field.separator=,/    

D stream.num.map.output.key.fields=4/    

D map.output.key.field.separator=,/    

D-num.key.fields.for.partition=2/    

-partitioner Org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner/

Result data:
$ Hadoop fs–cat/app/test/test_result/part-00003

1,2,1,1 1

1,2,2,1 1

1,2,3,1 1

$ Hadoop fs–cat/app/test/test_result/part-00004

1,3,1,1 1

1,3,1,1 1

1,3,2,1 1

1,3,2,1 1

1,3,3,1 1

1,3,3,1 1 Cases 4

Test data:
d&10.2.3.40 1
d&11.22.33.33 1
W&www.renren.com 1
W&www.baidu.com 1
d&10.2.3.40 1
Then by passing the command arguments
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner//Specify requirements two times ordered
-jobconf map.output.key.field.separator= ' & '/here if you don't add two single quotes, my orders will die.
-jobconf Num.key.fields.for.partition=1//Here is the first & symbol to be segmented to ensure that there is no error

-combiner..
Comparator class

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.