Hadoop Stream Parameters Detailed __hadoop

Source: Internet
Author: User
Tags readline stdin python script testlink

Original Address: Hadoop streaming Author:Tivoli_chen

1 Hadoop streaming

Hadoop streaming is a utility that is published with Hadoop. It allows users to create and execute maps or reduce mapreducejobs that are written using any program or script. For example,

$HADOOP _home/bin/hadoop jar $HADOOP _home/hadoop-streaming.jar

-input myinputdirs

-output Myoutputdir

-mapper/bin/cat

-REDUCER/BIN/WC


2hadoop straming Working mode

In the above example, both Mapper and reducer are executable programs, mapper input is read stdin,reducer output is output to stdout. Hadoopstreaming can create MapReduce tasks, submit tasks to the correct cluster, and monitor the execution of tasks until the task is completed.

when mapper is defined as an executable, each mapper task initialization starts the process independently. When the mappertask runs, the input files are converted into rows to be processed and then passed to stdin, while the output is processed as key-value as mapper output; By default, the data in front of the first tab of each row of data is key, and the remainder is value. If there is no tab in a row of data, the entire row of data is null as the Key,value value. However, these can also be customized for processing.

when reducer is defined as an executable, each reducer task initialization starts the process independently. When the reducertask runs, the input key-value data is converted into rows of data as input to reducer. At the same time, reducer collects row data and converts the row data into Key-value form output. By default, the data in front of the first tab of each row of data is key, and the remainder is value. However, these can also be customized for processing.

This is the basic communication protocol between the MapReduce framework and the Hadoop streaming.

users can also define Java classes as mapper and reducer, for example,

$HADOOP _home/bin/hadoop jar $HADOOP _home/hadoop-streaming.jar

- Input myinputdirs

-output myoutputdir

-mapper org.apache.hadoop.mapred.lib.IdentityMapper

-reducer/ BIN/WC

The user can define whether the Stream.non.zero.exit.is.failure value is true or false to indicate that the streaming task is successful or failed to exit.

3Job Submit settings File option

Users can define any program as mapper or reducer. These executable programs do not need to be stored on a clustered machine beforehand. If not, you can set up the program by using the-file option and submit it to the job. For example,

$HADOOP _home/bin/hadoop jar $HADOOP _home/hadoop-streaming.jar

-input myinputdirs

-output Myoutputdir

-mapper mypythonscript.py

-reducer/bin/wc

-file

mypythonscript.py This example takes the Python executable program as a mapper. Option-filemypythonscript.py causes the Python script to be transferred to the cluster machine as part of the job submission. In addition to executable program files, users can also pack some other files that may be mapper or reducer to use. For example,

$HADOOP _home/bin/hadoop jar $HADOOP _home/hadoop-streaming.jar

-input myinputdirs

-output Myoutputdir

-mapper mypythonscript.py

-reducer/bin/wc

-file mypythonscript.py

MyDictionary.txt

4streaming Options and usage

4.1 Job with Mapper.

Sometimes, the user simply needs to process the input data through the Mapper function. To do this, you can set up mapred.reduce.tasks=0. In this way, the MapReduce framework does not create reducetask. Also, the output of mapper is the final output of the job. To support backward compatibility, the Hadoop streaming also supports option-reducenone, equivalent to-D mapred.reduce.tasks=0.

4.2 Other options to define jobs

As a normal mapreduce job, users can define other options for the Hadoop streaming job,

-inputformat javaclassname

-outputformat Javaclassname

-partitioner javaclassname

-combiner javaclassname

Set input The class of format should return the Key-value key value pairs of the text class. If the InputFormat class is not set, the default is Textinputformat. Textinputformat returns the keys of the Longwritable class, which are not part of the input data, the keys may be discarded, and only values are transferred to the Streamingmapper. The class that sets the output format should be the Key-value key value pair of the text class. If the OutputFormat class is not defined, the default is Textoutputformat.

Large files and file files in the 4.3 Hadoop streaming

The-files and –archives options allow users to set task file and file files. parameter is the URI of the file and file files that are already on the HDFs. These files and file files are cached in the job. The user can obtain the host and Fs_port values from the fs.default.name configuration variable. For example, the-files option,-files hdfs://host:fs_port/user/testfile.txt#testlink In this example, the part following the # in the URL is the symbolic link for the current work task path. Therefore, these tasks have symbolic links to local files. Multiple inputs can be set as follows,-files Hdfs://host:fs_port/user/testfile1.txt#testlink1-files Hdfs://host:fs_port/user/testfile2.txt The #testlink2-archives option allows the user to copy the work path of the jar package to the current task and automatically unzip the jar package. For example,-archives Hdfs://host:fs_port/user/testfile.jar#testlink3 in this example, the symbolic link testlink3 is created in the working path of the current task.

The symbolic link points to the file path where the extract jar package is stored. Other examples of the-archives option, the Input.txt file has two lines of data, defining the names of two files, Testlink/cache.txt and Testlink/cache2.txt.

Testlink a symbolic link to the file directory, with two files Cache.txt and cache2.txt. $HADOOP _home/bin/hadoop jar $HADOOP _home/hadoop-streaming.jar-input "/user/me/samples/cachefile/input.txt"-mapper "Xargs cat"-reducer "cat"-output "/user/me/samples/cachefile/out"-archives ' hdfs://hadoop-nn1.example.com/user/me/ Samples/cachefile/cachedir.jar#testlink '-D mapred.map.tasks=1-d MAPRED.REDUCE.TASKS=1-DMApred.job.name= "experiment" $ ls test_jar/cache.txt cache2.txt $ jar CVF cachedir.jar-c. Added manifest adding:cache.txt (in = a) (out=) (deflated 3) adding:cache2.txt (in = Panax) (out=) (deflated 5) $ h

Adoop dfs-put Cachedir.jar samples/cachefile $ hadoop dfs-cat/user/me/samples/cachefile/input.txt testlink/cache.txt  Testlink/cache2.txt $ cat Test_jar/cache.txt This are just the cache string $ cat test_jar/cache2.txt This is just the Second cache string $ Hadoop dfs-ls/user/me/samples/cachefile/out Found 1 items/user/me/samples/cachefile/out/part- 00000 <r3> $ Hadoop dfs-cat/user/me/samples/cachefile/out/part-00000 This are just the cache string this is J UST the second cache string


4.4 Define other configuration variables for jobs

Users can use-d<n>=<v> to define other configuration variables. For example,

$HADOOP _home/bin/hadoop jar $HADOOP _home/hadoop-streaming.jar

-input myinputdirs

-output Myoutputdir

-mapper org.apache.hadoop.mapred.lib.IdentityMapper

-REDUCER/BIN/WC-

D

the mapred.reduce.tasks=2 option-D mapred.reduce.tasks=2 defines the job's reducer as 2.


4.5 Other supported options

Streaming supports Hadoop common command-line options. The main parameters supported are the following:

Bin/hadoop command [genericoptions] [commandoptions]

Change Local Temp folder-

D dfs.data.dir=/tmp

define additional local temporary folders-

D mapred.local.dir=/tmp/local-

D mapred.system.dir=/ Tmp/system

-D mapred.temp.dir=/tmp/temp

Set the environment variable in the streaming command

-cmdenv example_dir=/home/example/ dictionaries/


5 more examples of usage

5.1 Custom divides row data into Key-value key value pairs

When the MapReduce framework reads each row of data from a mapper stdout, it divides each row of data into Key-value key-value pairs. By default, the data before the first tab of each row of data is key, and the rest is value (excluding tab).

However, users can customize this default setting. The user can customize the delimiter (except for the default tab), and the user can also define the nth character as the kae-value separator. For example,

$HADOOP _home/bin/hadoop jar $HADOOP _home/hadoop-streaming.jar

-input myinputdirs

-output Myoutputdir

-mapper org.apache.hadoop.mapred.lib.IdentityMapper

-reducer Org.apache.hadoop.mapred.lib.IdentityReducer

-D stream.map.output.field.separator=.

-D stream.num.map.output.key.fields=4

option-D stream.map.output.field.separator=. Defines the separator for the Mapoutput field as.. The fourth one is the key, followed by value. If the row. Has fewer than four numbers, the entire row of data is Key,value is empty.

Also, you can set options-dstream.reduce.output.field.separator=sep and-dstream.num.reduce.output.fields= Num to set the Key-value for the reduce output.

5.2 Useful Partitioner Classes

Hadoop has a class keyfieldbasedpartitioner that is useful for many applications. This class allows the MapReduce framework to partition the output of a map based on some key fields. For example,

$HADOOP _home/bin/hadoop jar $HADOOP _home/hadoop-streaming.jar

-input myinputdirs

-output Myoutputdir

-mapper org.apache.hadoop.mapred.lib.IdentityMapper

-reducer Org.apache.hadoop.mapred.lib.IdentityReducer

- Partitionerorg.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

-D stream.map.output.field.separator=.

-D stream.num.map.output.key.fields=4

-D map.output.key.field.separator=.

Note:-k1,2 specifies that the 1th 2 domains are divided after the key is divided (the above explanation does not find the relevant document or the original text)
-D mapred.reduce.tasks=12

For example, output

outputs (keys)

11.12.1.2

11.14.2.3

11.11.4.1

11.12.1.1

The 11.14.2.2 is divided into 3 reducer (the first 2 fields as partition keys)

11.11.4.1

-----------

11.12.1.2

11.12.1.1

-----------

11.14.2.3

11.14.2.2

reducer sorting within each partition (4 fields are also used for sorting)

11.11.4.1

-----------

11.12.1.1

11.12.1.2

-----------

11.14.2.2

11.14.2.3


5.3Comparator class

Hadoop has a class keyfieldbasedcomparator that is applied in many programs. For example,

$HADOOP _home/bin/hadoop jar $HADOOP _home/hadoop-streaming.jar

-input myinputdirs

-output Myoutputdir

-mapper org.apache.hadoop.mapred.lib.IdentityMapper

-reducer Org.apache.hadoop.mapred.lib.IdentityReducer-

D

mapred.output.key.comparator.class= Org.apache.hadoop.mapred.lib.KeyFieldBasedComparator

-D stream.map.output.field.separator=.

-D stream.num.map.output.key.fields=4

-D map.output.key.field.separator=.

-D MAPRED.TEXT.KEY.COMPARATOR.OPTIONS=-K2,2NR
Note: In-k2,2nr,-k2,2 specifies the 2nd field after the key split, and n specifies to use a numeric sort, r to specify the sort result to be reversed
-D mapred.reduce.tasks=12

map output (keys)

11.12.1.2

11.14.2.3

11.11.4.1

11.12.1.1

11.14.2.2

reducer output (sorted using the second field)

11.14.2.3

11.14.2.2

11.12.1.2

11.12.1.1

11.11.4.1


5.4Hadoop Aggregate Package (-reduce aggregate option)

-D mapred.reduce.tasks=12

python file aggregatorforkeycount.py

#!/usr/bin/python

import sys;

def generatelongcounttoken (ID): Return
		"Longvaluesum:" + ID + "T" + "1"

def Main (argv): Line
		= Sys.stdin.readline ();
		Try: While line
				: line
						= line[:-1];
						Fields = Line.split ("T");
						Print Generatelongcounttoken (fields[0]);
						line = Sys.stdin.readline ();
		Except "End of File": Return
				None

if __name__ = = "__main__":
		Main (SYS.ARGV)

5.5 Field Selection

Hadoop has class org.apache.hadoop.mapred.lib.FieldSelectionMapReduce. This class allows users to work with text data like the Cut command in Unix tools. The map function in this class takes each input key value pair as a field list, and the user can customize the field separator character. The user can select the field list as the key for the output of the map, and the other field list as the value of the map's output. The reduce function is similar to this.

$HADOOP _home/bin/hadoop jar $HADOOP _home/hadoop-streaming.jar

-input myinputdirs

-output myoutputdir

-mapperorg.apache.hadoop.mapred.lib.fieldselectionmapreduce

- Reducerorg.apache.hadoop.mapred.lib.FieldSelectionMapReduce

- Partitionerorg.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

-D map.output.key.field.separa=.

-D mapred.text.key.partitioner.options=-k1,2

-D mapred.data.field.separator=.

-D map.output.key.value.fields.spec=6,5,1-3:0--D

reduce.output.key.value.fields.spec=0-2:5--

D Mapred.reduce.tasks=12

5.6 mapred attempt task failure control and map task failure rate control

-D mapred.map.max.attempts= "3" \                                                                                  
D-mapred.reduce.max.attempts= "3" \                                                                               
D-mapred.max.map.failures.percent= "1" \  Set the map task failure rate tolerance rate

5.7 mapred limits the maximum length of data rows to be read by Java (prevents progress from mapred program execution and reports heatbeat errors):
-D  mapred.linerecordreader.maxlength = 409600



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.