The first Hadoop authoritative guide in Xin Xing's notes is MapReduce and hadoopmapreduce.

Source: Internet
Author: User

The first Hadoop authoritative guide in Xin Xing's notes is MapReduce and hadoopmapreduce.

MapReduce is a programming model that can be used for data processing. This model is relatively simple, but it is not simple to compile useful programs. Hadoop can run MapReduce programs written in various languages. In essence, MapReduce programs run in parallel. Therefore, large data analysis tasks can be handed over to any operator with enough machines. MapReduce has the advantage of processing large-scale datasets.

The MapReduce task process is divided into two stages: map stage and reduce stage. Each stage uses key/value pairs as input or output, and the programmer selects their types. The programmer also needs to define two functions: map function and reduce function.

The input in the map stage is the raw data to be processed. we can select the text format as the input format. The map function is a data preparation stage. This method is used to prepare data so that the reduce function can continue to process the prepared data. After the map function output is processed by the MapReduce framework, it is finally sent to the reduce function. This process needs to be sorted and grouped by key-value pairs. To implement a specific operation, we need three things: a map function, a reduce function, and some code used to run the job.

The map function is implemented by the Mapper interface. The latter declares a map () method. This Mapper interface is a generic type and has four parameter types. The types of input keys, input values, output keys, and output values of the map function are determined respectively. Hadoop itself provides a set of basic types that can optimize network serialization and transmission without directly using Java Embedded types, which can be found in the org. apache. hadoop. io package. The input of the map () method is a key and a value. The map () method also provides an OutputCollector instance for writing output content.

Reduce functions use CER for similar definitions. For reduce functions, there are also four form parameter types used to specify their input and output types. The input type of the reduce function must match the output type of the map function.

The JobConf object specifies the job execution specification, which can be used to control the running of the entire job. When running this job on a Hadoop cluster, we need to package the code into a jar file, which Hadoop will distribute on the cluster. You do not need to specify the name of the jar file. You only need to pass a class in the JobConf constructor. Hadoop searches for jar files containing the class to find the relevant jar files.

After constructing the JobConf object, we need to specify the path of the input and output data. Call the static function addInputPath () of the FileInputFormat class to define the path of the input data. The path can be a single file or directory (all files in the directory are used as input) or a group of files that match the specified file mode. According to the function name, we can call addInputPath () multiple times to implement multi-path input.

We call the static function setOutputPath () in the FileOuputFormat class to specify the output path. This function specifies the write directory of the output file of the reduce function. This directory should not exist before running the task. Otherwise, Hadoop will report an error and refuse to run the task. This preventive measure is used to prevent data loss. It is very difficult to accidentally overwrite the result of a long running task.

We use setMapperClass () and setReduceClass () to specify the map and reduce types, and setOutputKeyClass () and setOutputValueClass () to control the output types of map and reduce functions. These two output types are often the same.

After setting the class that defines map and reduce functions, you can start running the task. The static function runJob () of the JobClient class submits the job and waits for completion. Finally, it writes the progress to the console.

If the first parameter of the hadoop command is the class name, Hadoop will automatically start a JVM to run the class. The use of hadoop commands is more convenient than java commands, because the former can add the Hadoop library file and its dependency path to the class path parameters, and also obtain the Hadoop configuration file. We need to define a HADOOP_CLASSPATH environment variable to add the path of the application class, and then execute related operations by the Hadoop script.

To achieve scaling out, we need to store data in a distributed file system, which is generally HDFS.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.