This section, the third chapter of the big topic, "Getting Started from Hadoop to Mastery", will teach you how to use XML and JSON in two common formats in MapReduce and analyze the data formats that are best suited for mapreduce big data processing.
In the first chapter of this chapter, we have a simple understanding of the concept of MapReduce data serialization and are not friendly to XML and JSON formats. This section, the third chapter of the big topic, "Getting Started from Hadoop to Mastery", will teach you how to use XML and JSON in two common formats in MapReduce and analyze the data formats that are best suited for mapreduce big data processing.
What is the most appropriate data format for big Data processing in mapreuce?
3.2.1 XML
Since its inception in 1998, XML has been used as a data format to represent both machine and human readable data. It is a common language for data exchange between systems and is now used by many standards, such as soap and RSS, and is used as an Open data format for products such as Microsoft Office.
MapReduce and XML
MapReduce bundles the inputformat used with text, but does not support XML, which means that native mapreduce is very unfriendly to XML. Processing a single XML file in parallel in MapReduce is tricky because XML does not contain synchronization tokens for its data format.
Problem
You want to use a large XML file in MapReduce and be able to split and process it in parallel.
Solution Solutions
Mahout's Xmlinputformat can be used for mapreduce processing of XML files in HDFs. It reads records separated by a particular XML start and end tag, and this technique also explains how to send XML as output in MapReduce.
MapReduce does not contain built-in support for XML, so we turn to another Apache project,--mahout, a machine learning system that provides XML InputFormat. To understand XML InputFormat, you can write a mapreduce job that reads the property names and values from the Hadoop configuration file (HDFS) using the mahout XML input format.
The first step is to configure the job:
Mahout's XML input format is primitive, we need to specify the exact start and end XML tags for the file search, and split the file (and extract the records) using the following method:
Files are separated into discrete sections along the HDFs block boundaries for data localization.
Each map task runs on a specific input split, the map task seeks the start of the input split, and then continues processing the file until the first Xmlinput.start.
Repeatedly emits the contents between Xmlinput.start and xmlinput.end until the end of the input split.
Next, you need to write a mapper to use the mahout XML input format. The text form already provides XML elements, so you need to use an XML parser to extract the content from the XML.
Table 3.1 Extracting content using the Java Stax Parser
The map has a text instance that contains a string representation of the data between the start and end tags. In this code, we can use the Java built-in streaming API for XML (StAX) parser to extract the key and value of each property and output it.
If you run a mapreduce job against Cloudera core-site.xml and display the output using the HDFs Cat command, you will see the following:
This output shows that XML has been successfully used as the input serialization format for MapReduce. Not only that, it can also support huge XML files because the input format supports splitting the XML.
Write XML
When we can read XML normally, we have to solve the problem of how to write XML. In Reducer, a callback occurs before and after a call to the main reduce method, which can be used to emit a start and end tag, as shown below.
Table 3.2 Reducer for issuing start and end tags
This can also be embedded in the OutputFormat.
Pig
If you want to use the Xml,piggy Bank library in pig (the user contributed pig code base) contains a xmlloader. It works very much like this technique, capturing everything between the start and end tags and providing it as a single-byte array field in a pig tuple.
Hive
There is currently no way to use XML in hive, you must write a custom Serde.
Summarize
Mahout's Xmlinputformat can help with XML, but it is sensitive to the exact string matching of the start and end element names. This method is not available if the element tag contains a property with a variable value, cannot control element generation, or may result in the use of an XML namespace qualifier.
If you can control the XML in the input, you can simplify this exercise by using a single XML element on each line. This allows the built-in mapreduce text-based input format (for example, Textinputformat), which treats each row as a record and split.
Another option that is worth considering is the preprocessing step, which transforms the original XML into a separate row for each XML element, or converts it to a completely different data format, such as Sequencefile or Avro, both of which resolve the split problem.
Now that you've learned how to use XML, let's deal with another popular serialization format JSON.
3.2.2 JSON
JSON shares the machine and human readable features of XML, and has existed since the beginning of 21st century. It is simpler than XML, but does not have rich types and validation capabilities in XML.
If there is some code that is downloading JSON data from the streaming rest service and writes the file to HDFs every hour. Because the amount of data downloaded is large, each file size generated is thousands of megabytes.
If you are asked to write a mapreduce job, you need to use a large JSON file as input. You can divide the problem into two parts: first, MapReduce has no inputformat to use with JSON; Second, how do you split the JSON?
Figure 3.7 shows the split JSON problem. Imagine that MapReduce created a split,. The map task that operates on this input split performs a search for the input split to determine the start of the next record. For file formats such as JSON and XML, it is challenging to know when the next record starts because of a missing synchronization token or any other identity record at the beginning.
JSON is more difficult to split into different segments than formats such as XML, because JSON does not have tokens (such as the closing tag in XML) to represent the beginning or end of a record.
Problem
You want to use JSON input in MapReduce and make sure that you can enter a JSON file for the concurrent read partition.
Solution Solutions
Elephant Bird Lzojsoninputformat is used as the basis for creating input format classes to use JSON elements, which can use multiple lines of JSON.
Figure 3.7 Example of a problem using JSON and multiple input splits
Discuss
Elephant Bird (Https://github.com/kevinweil/elephant-bird) is an open source project that contains useful programs for handling lzop compression. It has a lzojsoninputformat that can read JSON, although the input file is required to be lzop-compressed. , but you can use the Elephant Bird Code as your own JSON InputFormat template, which does not have lzop compression requirements.
This solution assumes that each JSON record is on a separate line. Jsonrecordformat is simple enough to do nothing except construct and return Jsonrecordformat, so we'll skip that code. Jsonrecordformat emits longwritable,mapwritable Key/value to the mapper, where mapwritable is the mapping of the JSON element name and its value.
Let's take a look at how Recordreader works, it uses Linerecordreader, which is a built-in MapReduce reader. To convert the row to mapwritable, the reader resolves the row to a JSON object using the Json-simple parser, and then iterates over the keys in the JSON object and places them with their associated values into the mapwritable. The mapper is given JSON data in longwritable, and mapwritable pairs can process the data accordingly.
The following shows an example of a JSON object:
This technique assumes a JSON object per line, and the following code shows the JSON file used in this example:
Now copy the JSON file to HDFs and run the MapReduce code. The MapReduce code writes each JSON Key/value pair and outputs:
Write JSON
Similar to section 3.2.1, the method of writing XML can also be used to write JSON.
Pig
Elephant Bird contains a jsonloader and lzojsonloader that you can use to work with JSON in pig, which uses line-based JSON. Each pig tuple contains the Chararray field for each JSON element in the row.
Hive
Hive contains a Delimitedjsonserde class that can serialize JSON, but unfortunately it cannot be deserialized, so you cannot use this serde to load data into hive.
Summarize
This solution assumes that the structure of the JSON input is one line per JSON object. So, how do I use JSON objects that span multiple lines? A project on GitHub (https://github.com/alexholmes/json-mapreduce) can perform multiple input splits on a single JSON file, which searches for a specific JSON member and retrieves the contained object.
You can view a Google project named Hive-json-serde, which can support both serialization and deserialization.
As you can see, using XML and JSON in MapReduce is very bad and has strict requirements on how to lay out data. MapReduce support for both formats is also complex and error-prone because they are not suitable for splitting. Obviously, you need to look at alternative file formats that have internal support and can be split.
The next step is to study complex file formats that are more appropriate for mapreduce, such as Avro and Sequencefile.
3.3 Large Data serialization format
When using scalar or tabular data, the unstructured text format is effective. Semi-structured text formats, such as XML and JSON, can model complex data structures that include composite fields or hierarchical data. However, when dealing with large amounts of data, we need a serialization format that has a compact serialized form, which natively supports partitioning and has schema evolution capabilities.
In this section, we'll compare the serialization format best suited for mapreduce Big data processing and follow up on how to use them with MapReduce.
3.3.1 Comparison Sequencefile,protocol Buffers,thrift and Avro
Based on experience, the following characteristics are important when choosing a data serialization format:
Code generation-some serialization formats have code-generation libraries that allow rich objects to be generated, making it easier to interact with the data. The generated code also provides additional benefits like security, to ensure that consumers and producers use the correct data types.
Architecture Evolution-Data models evolve over time, and it is important that data formats support the need to modify the data model. The Pattern Evolution feature allows you to add, modify, and, in some cases, delete properties while providing backward and forward compatibility for both read and write.
Language support-It may be necessary to access data in a variety of programming languages, and it is important to support data formats in mainstream languages.
Data compression-data compression is important because large amounts of data can be used. Also, the ideal data format is capable of compressing and decompressing data internally while writing and reading. If the data format does not support compression, this is a big problem for programmers, because it means that compression and decompression must be managed as part of the data pipeline (just as with a text-based file format).
Detachable-Newer data formats support multiple parallel readers to read and manipulate different blocks of large files. The file format contains synchronization tags that are critical (you can randomly search and scan to the beginning of the next record).
Support for MapReduce and Hadoop ecosystems-the data format chosen must support MapReduce and other key projects in the Hadoop ecosystem, such as hive. Without this support, you will be responsible for writing code to make the file format applicable to these systems.
Table 3.1 compares the popular data serialization frameworks to see how they stack up with each other. The following discussion provides additional background knowledge about these technologies.
Table 3.1 Functional comparisons of data serialization frameworks
Let's take a look at these formats in more detail.
Sequencefile
The Sequencefile format is created for use with MapReduce, pig, and hive, so it is well integrated with all tools. The disadvantage is lack of code generation and versioning support, as well as limited language support.
Protocol buffers
Protocol buffers has been heavily used by Google for interoperability, with the advantage that its version supports binary format. The disadvantage is that mapreduce (or any third-party software) lacks support for files generated by reading protocol buffers serialization. However, Elephant Bird can use protocol buffers serialization in the container file.
Thrift
Thrift is a data serialization and RPC framework developed internally by Facebook that does not support MapReduce in the local data serialization format, but can support different wire-level data representations, including JSON and various binary encodings. Thrift also includes an RPC layer with various types of servers. This chapter ignores RPC functionality and focuses on data serialization.
Avro
The Avro format was created by Doug Cutting and was designed to help compensate for sequencefile deficiencies.
Parquet
Parquet is a columnar file format with a rich Hadoop system support, and can work with Avro, Protocol buffers and thrift. Although Parquet is a column-oriented file format, do not expect one data file per column. Parquet saves all the data in a row in the same data file to ensure that all columns of a row are available when processed on the same node. What parquet does is set the HDFS block size and the maximum data file size to 1GB to ensure that I/O and network transfer requests are applied to large volumes of data.
Based on the above evaluation criteria, Avro seems to be best suited as a data serialization framework in Hadoop. Sequencefile followed, because it has intrinsic compatibility with Hadoop (it is designed for Hadoop).
You can view the Jvm-serializers project on GitHub, which runs a variety of benchmark tests to compare file formats based on serialization and deserialization times. It contains avro,protocol buffers and thrift benchmarks as well as many other frameworks.
After understanding the various data serialization frameworks, we will discuss these formats specifically in the next sections.
What is the most appropriate data format for big Data processing in mapreuce?