Hadoop streaming is a multi-language programming tool provided by Hadoop that allows users to write mapper and reducer processing text data using their own programming languages such as Python, PHP, or C #. Hadoop streaming has some configuration parameters that can be used to support the processing of multiple-field text data and participate in the introduction and programming of Hadoop streaming, which can be referenced in my article: "Hadoop streaming programming instance". However, as Hadoop applications become more widespread, users want the Hadoop streaming to be not limited to processing text data, but to have more powerful features, including the ability to handle binary data, and the ability to support components such as combiner in multiple languages. With the release of Hadoop 2.x, these features have been basically fully implemented, and this article describes how to use Hadoop streaming to handle binary format files, including Sequencefile,hfile.
Note: The program used in this article can be in Baidu Cloud: Hadoop-streaming-binary-examples download.
Before we go into the details of the procedure, let me introduce the examples given in this article. Suppose there is such a sequencefile, it saved the phone Address book information, where key is a friend name, value is a structure or object to describe the friend, for this purpose, this article uses the Google Open source Kyoto buffer, the serialization/deserialization framework, The Kyoto buffer structure is defined as follows:
The value in the Sequencefile file is the serialized string of the saved person object, this is a typical binary data that cannot be parsed with a newline character as text data, because each record of the binary data may contain any characters, including line breaks.
Once we have this sequencefile, we'll use the Hadoop streaming to write this MapReduce program: The MapReduce program has only a map task that resolves every buddy record in the file and takes the name \ t age , the phone,address text format is saved to the HDFs.
1. Preparing data
First, we need to prepare the Sequencefile data described above, and the core code to generate the data is as follows:
Note that the value save type is byteswritable and it is very easy to make mistakes using this type. When you save a bunch of byte] data to byteswritable, the data that is read through Byteswritable.getbytes () is not necessarily the original data and may grow a lot because Byteswritable uses an automatic memory growth algorithm When you keep the data length of size, it may save the data to a buffer of length capacity (capacity>size), at which point the data you get through byteswritable.getbytes () is redundant. If the inside is a Kyoto buffer serialized string, you cannot deserialize it, and you can use Byteswritable.setcapacity (Value.getsize ()) to remove the extra space behind it.
2. Use Hadoop streaming to write C + + programs
In order to illustrate how Hadoop streaming handles binary format data, this article only takes the C + + language as an example to illustrate that other languages have similar design methods.
Let's start with a simple theory. When the input data is in binary format, the Hadoop streaming will encode the input key and value and pass the standard input to your Hadoop streaming program, which currently provides two encoding formats, namely, Rawtypes and Typedbytes , you can design the format you want to use, both of which are as follows (specifically in the article "Hadoop Streaming Advanced Programming"):
Rawbytes:key and value are represented by "4 bytes of length + raw byte"
Typedbytes:key and value are represented by "1 byte type + 4 byte length + raw byte"
This article will be described using the first encoding format. Using this encoding means you can't get one line of content at a time like text data, but in turn you get key and value sequences, where key and value are made up of two parts, the first part is length (4 bytes), and the second part is byte content, like your key is Dongxicheng, Value is Goodman, the input data format passed to the Hadoop streaming program is one Dongxicheng 7 Goodman. To do this, we write the following mapper program to parse this data:
Where the accessibility functions are implemented as follows:
The program needs to be aware of the following points:
(1) Attention to the size-end coding rules, parsing key and value length, you need to flip the length of the byte.
(2) Pay attention to the end of the cycle, only rely on!cin.eof () decision is not enough, rely on this decision will lead to multiple output of a duplicate data.
(3) This program can only run under the Linux system, Windows operating system will not be able to run, because the standard input CIN under Windows and directly support the binary data read, you need to force it to reopen in binary mode and then use.
3. Program testing and operation
After the program is written, the first step is to compile the C + + program. Since the program needs to run on a multi-node Hadoop cluster, to avoid the hassle of deploying or distributing a dynamic library, we are using static compilation directly, which is the basic rule for writing a Hadoop C + + program. To statically compile the above MapReduce program, the following process is required to install Kyoto buffers (emphasis on the first step)
./configure–disable-shared
Make–j4
Make install
Then use the following command to compile the program to generate the executable file Protomapper:
g++-o protomapper ProtoMapper.cpp person.pb.cc ' pkg-config–cflags–static–libs protobuf '-lpthread
Before you formally submit a program to the Hadoop cluster, you need to test locally, and the local test run script is as follows:
Note the following points:
(1) Use Stream.map.input to specify input data to parse into rawbytes format
(2) Use-JT and-fs two parameters to set program run mode to local mode
(3) Use-inputformat to specify the input data format as Sequencefileinputformat
(4) Use Mapred.reduce.tasks to set the reduce task number to 0
To see if the test results are correct in the local tmp/output111 directory, and if that's OK, rewrite the script (remove the-fs and-JT two parameters, set the input and output directories to the HDFs directory), and run the program directly on Hadoop.