1. Hadoop Java API
The main programming language for Hadoop is Java, so the Java API is the most basic external programming interface.
2. Hadoop streaming
1. Overview
It is a toolkit designed to facilitate the writing of MapReduce programs for non-Java users.
Hadoop streaming is a programming tool provided by Hadoop that allows users to use any executable file or script file as mapper and reducer,
For example:
Use some of the commands in the Shell scripting language as Mapper and reducer (cat as MAPPER,WC as reducer)
$HADOOP _home/bin/hadoop jar $HADOOP _home/contrib/streaming/hadoop-*-streaming.jar \
-input myinputdirs \
-output myoutputdir \
-mapper cat \
-reducer WC
2, Hadoop streaming principle
Mapper and Reducer will read the user data from the standard input and send it to standard output in one line of processing. The streaming tool creates a mapreduce job, sends it to each tasktracker, and monitors the execution of the entire job.
If a file (executable or script) is initialized as Mapper,mapper, each mapper task initiates the file as a separate process, and when the mapper task runs, it divides the input into rows and provides each row with the standard input for the executable process. At the same time, mapper collects the contents of the executable process standard output and converts each line received into a Key/value pair as the output of the mapper. By default, the part of the first tab in a row is the key, followed by (excluding tab) as value. If there is no tab, the entire row is a key value, and the value is null.
For reducer, similar.
The above is the basic communication protocol between the Map/reduce framework and the streaming mapper/reducer.
3. Hadoop Streaming usage
Usage: $HADOOP _home/bin/hadoop jar \
$HADOOP _home/contrib/streaming/hadoop-*-streaming.jar [Options]
Options
(1)-input: Input file path
(2)-output: Output file path
(3)-mapper: User-written mapper program, can be executable file or script
(4)-reducer: User-written reducer program, can be executable file or script
(5)-file: Packaging files to the submitted job, can be mapper or reducer to use the input files, such as configuration files, dictionaries and so on.
(6)-partitioner: User-defined Partitioner program
(7)-combiner: User-defined Combiner program (must be implemented in Java)
(8)-D: Some properties of the job (formerly-jonconf), specifically:
1) Number of mapred.map.tasks:map tasks
2) Number of mapred.reduce.tasks:reduce tasks
3) Stream.map.input.field.separator/stream.map.output.field.separator:map task input/output number
The default is \ t for the delimiter.
4) Stream.num.map.output.key.fields: Specifies the number of fields in the map task output record that the key occupies
5) Stream.reduce.input.field.separator/stream.reduce.output.field.separator:reduce task input/output data delimiter, default is \ t.
6) Stream.num.reduce.output.key.fields: Specify the number of fields in the reduce task output record for key
In addition, Hadoop itself comes with some handy mapper and reducer:
(1) Hadoop aggregation function
Aggregate provides a special reducer class and a special Combiner class, and there is a series of "aggregators" (such as "sum", "Max", "min", etc.) used to aggregate a set of value sequences. Users can use aggregate to define a mapper plug-in class that produces "aggregatable items" for each key/value pair entered by mapper. The Combiner/reducer aggregates these aggregatable items with the appropriate aggregator. To use aggregate, you only need to specify "-reducer aggregate".
(2) Selection of fields (similar to ' cut ' in Unix)
Hadoop's tool class Org.apache.hadoop.mapred.lib.FieldSelectionMapReduc helps users work with text data efficiently, like the "Cut" tool in Unix. The map function in the tool class considers the input key/value pairs as a list of fields. The user can specify the delimiter for the field (by default, tab), and you can select any paragraph in the field list (consisting of one or more fields in the list) as the key or value for the map output. Similarly, the reduce function in the tool class considers the input key/value pairs as a list of fields, and the user can select any segment as the key or value for the reduce output.
For Hadoop streaming advanced programming methods, refer to this article:Hadoop streaming advanced programming, Hadoop programming instances .
3. Hadoop Pipes
It is a toolkit designed for easy-to-C + + users to write MapReduce programs.
Hadoop pipes allows C + + programmers to write MapReduce programs that allow users to mix five components of C + + and Java Recordreader, Mapper, Partitioner,rducer, and Recordwriter.
1. What is Hadoop pipes?
Hadoop pipes allows users to use the C + + language for MapReduce programming. The main method it takes is to put the C + + code of the application logic in a separate process, and then let the Java code communicate with C + + code through the socket. To a large extent, this approach is similar to Hadoop streaming, where communication differs: one is the standard input output and the other is the socket.
There is a public static method in the Org.apache.hadoop.mapred.pipes.Submitter package that is used to submit the job, which encapsulates the job into a Jobconf object and a Main method (receiving an application, Optional configuration file, input directory and output directory, etc.), the main method of the CLI (Client line Interface) is as follows:
bin/hadoop pipes \
[-input inputDir] \ #输入数据目录
[-output outputDir] \ #输出数据目录 [-jar applicationJarFile] \ #应用程序jar包
[-inputformat
class
] \ #Java版的InputFormat
[-map
class
] \ #Java版的Mapper
[-partitioner
class
] \#Java版的Partitioner
[-reduce
class
] \#Java版的Reducer
[-writer
class
] \ #Java版的 RecordWriter
[-program program url] \ #C++可执行程序
[-conf configuration file] \#xml配置文件
[-D property=value] \ #配置JobConf属性
[-fs local|namenode:port] \#配置namenode
[-jt local|jobtracker:port] \#配置jobtracker
[-files comma separated list of files] \ #已经上传文件到HDFS中的文件,它们可以像在本地一样打开
[-libjars comma separated list of jars] \#要添加到classpath 中的jar包
[-archives comma separated list of archives]#已经上传到HDFS中的jar文件,可以 在程序中直接使用
|
This paper mainly introduces the design principles of Hadoop pipes, including design architecture, design details, and so on.
2. Hadoop Pipes Design Architecture
User through Bin/hadoop Pipes commits the job to the Submmit class in Org.apache.hadoop.mapred.pipes, which first makes the job parameter configuration (calling the function Setuppipesjob) and then passes Jobclient (conf). SubmitJob (CONF) submits the job to the Hadoop cluster.
In the function setuppipesjob, the Java code creates the server object using Serverscoket and then executes c++binary through Processbuilder, C++binary is actually a socket client, It receives key/value data from the Java server, is processed (map,partition or reduce, etc.) and is returned to Java server and written by Java server to HDFs or to disk.
3. Hadoop Pipes Design Details
Hadoop pipes allows users to write five basic components in C + +: Mapper,reducer,partitioner,combiner,recordreader, these five components can be written in Java, or in C + +, The following sections describe the execution of these functions respectively.
(1) Mapper
Pipes will customize the InputFormat according to the user's configuration, if the user wants to use Java's InputFormat (hadoop.pipes.java.recordreader=true), Then Hadoop causes the user to enter the InputFormat (default is Textinputformat), and if the user uses the C + + InputFormat, the pipes Java-side code reads each inputsplit, and calls Downlink.runmap (Reporter.getinputsplit (), Job.getnumreducetasks (), Isjavainput), and Runmap to the C + + side via the socket (string _inputsplit, int _numreduces, bool pipedinput) function.
On the C + + side, Recordreader parses the entire inputsplit, obtains the data source (primarily the file path) and each key/value pair, and gives the map function processing, which maps the results of each key/value through emit (const string& key, const string& value) function is returned to Java Server.
(2) Paritioner
The results of the C + + end process are passed to the Java Server via the emit (const string& key, const string& value) function to write the data to disk. In the emit function, if the user defines his or her own paritioner, pipes will use this function to determine which reduce task is being processed by the current Key/value and call partitionedoutput (int reduce, const The string& key,const string& value) function passes the key/value to the corresponding reduce task.
(3) Reducer
The execution process of reducer is basically consistent with mapper.
4. Summary
Hadoop Pipes provides C + + programmers with a scheme to write mapreduce jobs that use sockets to communicate between Java and C + +, similar to the principle of thrift RPC, which may be easier to write with thrift Hadoop pipes.
Hadoop Pipes uses Java code to read and write data from HDFs, encapsulates processing logic into C + +, and data is transferred from Java to C + + via a socket, which increases the cost of data transfer, but may improve performance for computationally intensive jobs.
Reprinted from: http://dongxicheng.org/mapreduce/hadoop-streaming-programming/
http://dongxicheng.org/mapreduce/hadoop-pipes-programming/
Http://dongxicheng.org/mapreduce/hadoop-pipes-architecture/
Hadoop Java API, Hadoop streaming, Hadoop Pipes three comparison learning