In Hadoop, data processing is resolved through the MapReduce job. Jobs consist of basic configuration information, such as the path of input files and output folders, which perform a series of tasks by the MapReduce layer of Hadoop. These tasks are responsible for first performing the map and reduce functions to convert the input data to the output results.
To illustrate how MapReduce works, consider a simple input file that contains tab-delimited records (leftmost in the following illustration). For the purpose of illustration, each row is marked as a to F.
The MapReduce job is specified according to this input file, with some internal logic, assuming that the MapReduce records A to C should be performed by a map task (Map Mission 0) and that D to F should be executed by another map (the map task 1).
When defining a MapReduce job, the Mapper class has been identified. The Mapper class is a class that contains a map function. Each map task performs a map task for each input record so that the map task 0 executes the map function three times, once for record A, again for record B, and again for Record C. Similarly, map Task 1 executes the map function against each of its three records, D to F.
The map function is written to accept a row of input data and is expected to emit output data in the form of a combination of keys and values. A typical map function emits a single key and a value derived from the input data, but it is possible for a mapping function to generate 0 or more key values and value combinations based on the input data. In our example, there is a one-to-one correlation between the input data row and the combination of key values emitted by the map function.
How the key and value emitted by the map function should be. In our example, the key is identified as having K1,K2 and K3 with various values associated with each. The key and value provided by a given map function is provided entirely by the Mapper class developer, but the usual key is what data can be grouped, and the value is some input value-can include a separator list value, XML or JSON fragment, or something more complex- This will be useful for the reduce feature. If now is not 100% clear, then hopefully once we cover the reduced functionality.
Once the map task completes the work, the data from the various calls from the map function will be sorted and mixed so that all values associated with the given key are arranged together. Generate a reduction task and give the key and its value to the processing task. In our example, MapReduce has generated two simplified tasks and given K1 and K2 values to reduce task 0 and K3 to reduce task 1.
As part of the MapReduce job definition, a reducer class is defined. The class contains a reduce function that provides an array of key values and their associated values for processing through the reduce task. The reduce function typically produces a summary value that can be derived from an array of values for a given keyword, but before the developer retains considerable flexibility based on the output of the output and how it derives.
Write the reduce function output for each reducer task to a file in the output folder that is recognized as part of the MapReduce job configuration. A single file is associated with each reduction task, so in our example, with two simplified tasks, two output files are generated. These files can be accessed individually, but more typically use the Getmerge command (on the command line) or similar functions to combine them into a single output file.
If this explanation makes sense, let's add a bit of complexity to the story. Not every job contains a mapper and a reducer class. At the very least, the mapreduce job must have a mapper class, but if you can handle all the data processing work through the map function, it will be the end of the road. In this case, the output file is aligned with the mapping task and a similar name is found in the output folder that is identified with the job configuration.
In addition, MapReduce jobs can accept combiner classes. The Combiner class is defined as a reducer class (and has the reduce function), but it runs on the data associated with a single mapped task. The idea behind the Combiner class is that you can "shrink" (summarize) some data before sorting and blending. Sorting and blending data to convert it to the reduce task is an expensive operation, and the Combiner class can act as an optimizer to reduce the amount of data moved between tasks. The combo class is absolutely not necessary, and you should consider using them when you absolutely have to squeeze performance out of our mapreduce jobs.
In the last article, we built a simple mapreduce job using C #. But Hadoop is a Java-based platform. So how do we use. NET language to perform mapreduce jobs. The answer is Hadoop streaming.
In short, Hadoop streaming is a feature that allows any executable file to act as a mapper and/or a reductant in a mapreduce job. With this feature, MapReduce uses its standard input and output (that is, stdin and stdout) to exchange data with an executable file. Also, you can write and compile executables, load them into a cluster, and execute flow jobs from the command line, allowing the. NET SDK to handle the installation and execution of the flow job on your behalf, which is exactly what happened in your last article.
As flexible as the Hadoop streaming, it has some limitations. First, the executable file must be able to run on the data nodes in the cluster. Use the. NET SDK, which means that the. NET 4.0 framework (which is already available on your Hdinsight data node) is dependent, so you cannot use it unless you deploy Hadoop on Windows. NET language to write MapReduce jobs.
Next, Hadoop streaming can use a limited number of file formats. For Hadoop streaming on Hdinsight, the default limit is files that contain line-oriented text (carriage return/newline separator) and JSON-formatted files. If you need to process files in an alternative format, you can pass the names of these files and resolve them using the flexibility of the map method and the. NET Framework.
Finally, Hadoop streaming requires data to flow from your mappers and reducers text in a key + TAB + value format. If you are writing a mapreduce job using the. NET SDK (as we did in the last article), the. NET SDK will handle the last constraint for you.
If you want more information about Hadoop streaming in Hadoop, check this resource.