"Graphics" distributed parallel programming with Hadoop (i)

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop is an open source distributed parallel programming framework that realizes the MapReduce computing model, with the help of Hadoop, programmers can easily write distributed parallel program, run it on computer cluster, and complete the computation of massive data. This paper will introduce the basic concepts of MapReduce computing model, distributed parallel computing, and the installation and deployment of Hadoop and its basic operation methods.

Introduction to

Hadoop

Hadoop is an open source, distributed, parallel programming framework that runs on large clusters, because distributed storage is essential for distributed programming, and includes a distributed filesystem HDFS (Hadoop Distributed File System). Perhaps so far, Hadoop is not so well known, its latest version is only 0.16, and it seems to be a long distance from 1.0, but two other open-source projects, Nutch and Lucene, which refer to Hadoop in the same vein, are the founders of Doug. Cutting), that's definitely famous. Lucene is a Java developed open source high-performance full-text Search Toolkit, it is not a complete application, but a set of Easy-to-use API. Around the world, there are countless software systems, Web sites based on Lucene to achieve Full-text search, and later Doug cutting opened the first open source Web search engine (http://www.nutch.org) Nutch, which in the Lucene On the basis of increased network crawler and some and WEB-related functions, some parsing various types of document format Plug-ins, in addition, Nutch also contains a distributed file system to store data. After Nutch version 0.8.0, Doug cutting the Distributed file system in Nutch and the code to implement the MapReduce algorithm to form a new Open-source Hadoop. Nutch also evolved into an open source search engine based on Lucene Full-text search and the Hadoop distributed computing platform.

Based on Hadoop, you can easily write distributed parallel programs that handle massive amounts of data, and run them on a large cluster of computers consisting of hundreds of nodes. From the current situation, Hadoop is destined to have a glorious future: "Cloud computing" is currently a hot moxibustion technology terminology, the world's major IT companies are investing in and promoting this new generation of computing models, and Hadoop is one of the major companies in its "cloud computing" environment important basic software, such as: Yahoo is using the power of Hadoop open source platform to fight Google, in addition to funding the Hadoop development team, is also developing a Hadoop based open source project Pig, this is a focus on mass data set analysis of distributed computing programs. Amazon S3 Amazon simple Storage service based on Hadoop, providing reliable, fast, scalable networked storage services, and a commercially available cloud platform Amazon EC2 Amazon Elastic Compute Cloud). In IBM's cloud computing project, the Blue Cloud program, Hadoop is also an important base software. Google is working with IBM to promote cloud-based computing based on Hadoop.

to meet the change of programming mode

Under the action of Moore's Law, in the past programmers do not have to consider the performance of the computer will not be able to keep up with the development of software, because about every 18 months, the CPU will increase the frequency, performance will be increased by one times, the software does not have to make any changes, you can enjoy free performance. However, since the transistor circuit has gradually approached its physical performance limit, Moore's law began to fail in about 2005 years, and humans can no longer expect a single CPU to double every 18 months, providing us with more and more fast computing performance. Intel, AMD, IBM and other chip manufacturers start from multi-core perspective to explore the performance potential of the CPU, the multi-core era and the advent of the Internet era, will make a major change in software programming methods, based on multi-core multithreaded concurrent programming and large-scale computer cluster based distributed parallel programming is the future soft The main way to improve the performance of a piece.

Many people believe that this major change in programming will lead to a software concurrency crisis, because our traditional software approach is basically the order of single order data flow execution, this sequential execution is very consistent with human thinking habits, but with concurrent parallel programming is incompatible. Distributed parallel programming based on clustering enables software and data to run simultaneously on many computers connected to a network, where each computer can be a common PC. The biggest advantage of such a distributed parallel environment is that it is easy to expand the new computing nodes by adding computers, and thus obtains the incredible massive computation ability, but also has the very strong fault tolerant ability, the batch computation node invalidation does not affect the computation normal performance as well as the result correctness. Google does this by using a parallel programming model called MapReduce, which runs distributed and parallel programming, running on a distributed file system called GFS, which provides search services to hundreds of millions of users around the world.

Hadoop implements Google's MapReduce programming model, provides an Easy-to-use programming interface, and provides its own distributed file system HDFS, unlike Google, where Hadoop is open source, and anyone can use it for parallel programming. If the complexity of distributed parallel programming is enough to intimidate ordinary programmers, the advent of open source Hadoop has dramatically lowered its threshold, and after reading this article, you'll find that programming based on Hadoop is very simple, and you can easily develop distributed parallel programs without any parallel development experience, And make it incredibly run at the same time on hundreds of machines, and then in a short period of time to complete the calculation of massive data. You might think you can't have hundreds of machines running your parallel program, and in fact, with the popularity of "cloud computing", anyone can easily get such a huge amount of computing power. Amazon's cloud computing platform, Amazon EC2, for example, has already provided this on-demand rental service, and interested readers can find out about the third part of the article.

Mastering the knowledge of distributed parallel programming is essential for future programmers, and Hadoop is so simple and easy to use. Maybe you're impatient to try something based on Hadoop, but after all, this programming model is very different from the traditional sequential program, and a little bit of basic knowledge is needed to better understand how the distributed parallel programs based on Hadoop are written and run. So this article will first introduce the MapReduce computing model, the Distributed file system in Hadoop HDFS, how Hadoop implements parallel computing, and then describes how to install and deploy the Hadoop framework and how to run the Hadoop program.

MapReduce Calculation Model

MapReduce is Google's core computing model, which abstracts the complexity of parallel computing processes running on large clusters into two functions, Map and Reduce, a surprisingly simple yet powerful model. A basic requirement for datasets (or tasks) that are suitable for use with MapReduce is that the data set to be processed can be decomposed into many small datasets, and each small dataset can be processed in parallel.

Figure 1. MapReduce calculation process

Figure one illustrates the process of handling large datasets with MapReduce, the MapReduce calculation process, in short, is to decompose large datasets into hundreds of small datasets, each of which is handled by a node in the cluster (typically a common computer) and is produced into intermediate results, and then these intermediate results are merged by a large number of nodes to form the final result.

The core of the computational model is the map and Reduce two functions, which are implemented by the user, and the function is to convert the input <key to a certain mapping rule, value> to another or a batch of <key, value> to the output.

Table one MAP and Reduce functions

function input and output description

Map

"K1,v1" List ("K2,v2")

1. The small data set is further parsed into a batch of "key,value" pairs, which are processed in the input Map function.
2. Each input "K1,V1" will output a batch of "K2,v2". "K2,v2" is the intermediate result of the calculation.

Reduce

"K2,list (v2)" "K3,v3"

In the middle result of the input, the list (v2) in K2,list (v2) is a batch of value that belongs to the same K2

An example of a program that calculates the number of occurrences of each word in a text file,<k1,v1> can be the offset of a < line in a file, a line in a file, and a map function to form a batch of intermediate results < words, and Reduce The function can process the intermediate result, add the number of occurrences of the same word, and get the total number of occurrences of each word.

It is very simple to write a distributed parallel program based on MapReduce computing model, the programmer's main coding work is to realize the MAP and Reduce functions, and other complex problems in parallel programming, such as distributed storage, job scheduling, load balancing, fault-tolerant processing, network communication, etc. The MapReduce framework, such as Hadoop, is handled, and the programmer doesn't have to worry about it at all.

Parallel computing

four cluster

The MapReduce computing model is ideal for parallel operations on large clusters of computers. Each map task in figure I and each Reduce task can run at the same time on a separate compute node, and it is conceivable that the computational efficiency is high, so how does this parallel computing work?

Data Distribution Storage

The Distributed File System (HDFS) in Hadoop consists of a management node (Namenode) and N data nodes (DataNode), each node being a common computer. On the use of the same as we are familiar with the file system is very similar to the same can be built directory, create, copy, delete files, view the contents of the file. But the bottom of the implementation is to cut the file into block, and then these blocks are distributed in different DataNode, each block can also replicate several stored in different DataNode, to achieve fault tolerant disaster tolerance. Namenode is the core of the entire HDFS, by maintaining some data structures, recording how many blocks each file has been cut into, which can be obtained from which DataNode, the status of each DataNode, and other important information. If you want to learn more about HDFS, you can read more: The Hadoop distributed File system:architecture and Design

Distributed Parallel Computing

Hadoop has a master-controlled jobtracker for scheduling and managing other Tasktracker, Jobtracker can run on any computer in the cluster. Tasktracker is responsible for performing tasks and must run on DataNode, that is, DataNode is both a data storage node and a compute node. Jobtracker distributes Map tasks and Reduce tasks to idle tasktracker, allowing these tasks to run in parallel and monitor the operation of tasks. If one of the tasktracker fails, Jobtracker transfers its responsible task to another idle tasktracker.

Local Computing

On which computer the data is stored, the computer calculates this part of the data, which reduces the transmission of data over the network and reduces the need for network bandwidth. In a cluster-based distributed parallel system such as HADOOP, compute nodes can be easily expanded, and because it can provide a near unlimited computational power, but by the data need to flow between different computers, so network bandwidth has become a bottleneck, is very valuable, "local computing" Is the most effective means of saving network bandwidth, which the industry describes as "mobile computing is more economical than moving data."

Figure 2. Granularity of distributed storage and parallel computing tasks

When you cut an original large dataset into a small dataset, the small dataset is usually less than or equal to the size of a block in HDFS (the default is 64M), which ensures that a small dataset is on a computer for local computation. With M small dataset to be processed, start m map task, note that this m map task is distributed in parallel on N machines, and the number of Reduce tasks can be specified by the user.

Partition

Divide the middle result of the MAP task output into R (R is the number of predefined Reduce tasks) by key. Hash functions such as: hash (key) mod R can be used to ensure that key within a certain range is handled by a Reduce task. Can simplify the process of reduce.

Combine

Before partition, can also do the intermediate result first combine, will have the same key in the intermediate result <key, value> to merge into a pair. The combine process is similar to the process of reduce, in many cases directly using the reduce function, but combine is a part of the map task that executes immediately after the map function is executed. Combine can reduce network traffic by reducing the number of <key and value> pairs in intermediate results.

Reduce Task takes intermediate result from MAP task node

The intermediate results of the MAP task are stored on the local disk as files after Combine and Partition are finished. The location of the intermediate result file notifies the master Jobtracker, Jobtracker then notify the Reduce task to which DataNode to take the intermediate result. Note that all MAP tasks produce intermediate results that are divided into r parts by their key with the same Hash function, and the R Reduce task is responsible for a key interval. Each reduce requires a number of MAP task nodes to obtain intermediate results that fall within their responsible Key interval, and then execute the Reduce function to form a final result file.

Task Pipeline

With the R Reduce task, there will be an end result of R, which in many cases is not necessarily merged into a final result. Because the final result of this R can be used as input to another computing task, another parallel computing task begins.

Five Hadoop first experience

Hadoop supports Linux and Windows operating systems, but its official web site declares that the distributed operations of Hadoop are not rigorously tested on windows, suggesting that Windows be used only as a development platform for Hadoop. Installation steps on the Windows environment are as follows (Linux platform is similar and simpler):

(1) in Windows, you need to install the Cgywin, install Cgywin Note must choose to install OpenSSH (in Net category). After the installation is complete, add the Cgywin installation directory, such as C:\cygwin\bin, to the system environment variable PATH, because running Hadoop will execute scripts and commands in some Linux environments.

(2) Install Java 1.5.x and set the JAVA_HOME environment variable to the Java installation root directory such as C:\Program files\java\jdk1.5.0_01.

(3) to the official Hadoop website http://hadoop.apache.org download Hadoop Core, the latest stable version is 0.16.0. Extract the downloaded installation package to a directory, which is assumed to be uncompressed to c:\hadoop-0.16.0.

4 Modify the conf/hadoop-env.sh file in which to set the JAVA_HOME environment variable: Export java_home= "C:\Program files\java\jdk1.5.0_01" (because program in the path There are spaces in the middle of the Files, be sure to enclose the path in double quotes.

At this point, everything is ready to run Hadoop. The following running process requires starting Cygwin and entering the simulated Linux environment. In the download of the Hadoop Core package, there are several sample programs that have been packaged into Hadoop-0.16.0-examples.jar. There is a WordCount program, the function is to count the number of words in a batch of text files, we first look at how to run this program. Hadoop has three modes of operation: single-machine (non-distributed) mode, pseudo-distributed operation mode, distributed operation mode, the first two operating modes can not embody the advantages of Hadoop distributed computing, and have no practical significance, but it is very helpful to test and debug the program, we start from these two models, Learn how distributed parallel programs based on Hadoop are written and run.

Stand-alone (non-distributed) mode

This mode is run on a single machine, without Distributed file system, but directly read and write to the local operating system file system.

Code Listing 1

$ cd/cygdrive/c/hadoop-0.16.0 $ mkdir test-in $ cd test-in #在 test-in directory to create two text files, WordCount program will count the occurrences of each word $ echo "Hello World Bye World ">file1.txt $ echo" Hello Hadoop goodbye Hadoop ">file2.txt $ CD. $ bin/hadoop jar Hadoop-0.16.0-examples.jar wordcount test-in test-out #执行完毕, below to view execution results: $ cd test-out $ cat part-00000 bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2

Note: When running Bin/hadoop jar Hadoop-0.16.0-examples.jar wordcount test-in test-out, be sure to note that the first parameter is jar, not-jar, when you use-jar, Will not tell you that the parameter is wrong, the error message reported is: Exception in thread "main" java.lang.noclassdeffounderror:org/apache/hadoop/util/ Programdriver, the author thought is the classpath of the setting problem, wasted a lot of time. By analyzing the Bin/hadoop script,-jar is not a bin/hadoop script-defined parameter, this script takes-jar as a Java parameter, and the Java-jar parameter represents executing a JAR file (this jar file must be an executable jar, that is, in the The main class is defined in MANIFEST, at which point the externally defined classpath is not functional and thus throws a Java.lang.NoClassDefFoundError exception. The jar is a bin/hadoop script-defined parameter that invokes a tool class Runjar of Hadoop, which can also execute a JAR file, and the externally defined classpath is valid.

Pseudo distributed operation mode

This pattern is also run on a single machine, but with different Java processes mimicking the various nodes in the distributed operation (Namenode, DataNode, Jobtracker, Tasktracker, secondary namenode), Note the difference between these nodes in a distributed operation:

From the perspective of distributed storage, nodes in a cluster consist of one namenode and several DataNode, and a secondary namenode as a backup of Namenode. From the point of view of distributed application, the node in the cluster is composed of a jobtracker and several tasktracker, Jobtracker is responsible for the task scheduling, and Tasktracker is responsible for executing the task in parallel. The Tasktracker must be run on the DataNode so that it is easy to compute local data. Jobtracker and Namenode do not need to be on the same machine.

(1) Modify conf/hadoop-site.xml in code Listing 2. Note that Conf/hadoop-default.xml is the default parameter for Hadoop, and you can read this file to see what parameters are available in Hadoop, but do not modify the file. You can change the default parameter value by modifying Conf/hadoop-site.xml, and the parameter values set in this file override Conf/hadoop-default.xml parameters with the same name.

Code Listing 2

</property> <property>

</property>

<name>fs.default.name</name> <value>localhost:9000</value>

<name>mapred.job.tracker</name> <value>localhost:9001</value>

<name>dfs.replication</name> <value>1</value>

</configuration>

Parameter fs.default.name specifies the IP address and port number of the Namenode. The default value is file:///, which means using a local file system for stand-alone, non-distributed mode. Here we specify the use of Namenode running on the native localhost.

Parameter Mapred.job.tracker specifies the IP address and port number of the Jobtracker. The default value is local, which means that Jobtracker and Tasktracker are executed within the same Java process locally for stand-alone, non-distributed mode. Here we specify the use of Jobtracker running on the native localhost (jobtracker with a separate Java process).

The parameter dfs.replication specifies the number of times each block in the HDFS is replicated, and the role of data redundancy backups. In a typical production system, this number is often set to 3.

(2) Configure SSH as shown in Listing 3:

Code Listing 3

$ ssh-keygen-t dsa-p-F ~/.SSH/ID_DSA $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

After configuring, perform an ssh localhost, verify that your machine can be connected with SSH, and do not need to manually enter the password when connecting.

(3) format a new Distributed file system, as shown in Listing 4:

Code Listing 4

$ cd/cygdrive/c/hadoop-0.16.0 $ bin/hadoop Namenode–format

(4) Start the Hadoop process, as shown in Listing 5. The output information on the console should show up Namenode, Datanode, secondary namenode, Jobtracker, Tasktracker. After startup is complete, you should see that 5 new Java processes are started through PS–EF.

Code Listing 5

$ bin/start-all.sh $ ps–ef

(5) Run the WordCount application, as shown in Listing 6:

Code Listing 6

$ bin/hadoop dfs-put./test-in input #将本地文件系统上./test-in directory to the root directory of HDFS, directory name changed to input #执行 Bin/hadoop dfs–help can learn a variety of HDFS The use of the command. $ bin/hadoop jar Hadoop-0.16.0-examples.jar wordcount Input Output #查看执行结果: # Copy files from HDFS to local file system and then view: $ Bin/hadoop Dfs-get Output output $ cat output/* #也可以直接查看 $ bin/hadoop dfs-cat output/* $ bin/stop-all.sh #停止 Hadoop process

Fault Diagnosis

(1) Execute $ bin/start-all.sh After the Hadoop process is started, 5 Java processes are started and five PID files are created in the/tmp directory to record the process ID numbers. Through these five files, you can learn about Namenode, Datanode, secondary namenode, Jobtracker, Tasktracker, respectively, which Java process corresponds to. When you feel that Hadoop is not working properly, you can first see if the 5 Java processes are running correctly.

(2) using the Web interface. Access http://localhost:50030 can view the running state of Jobtracker. Access http://localhost:50060 can view the running state of Tasktracker. Access http://localhost:50070 can view the status of Namenode and the entire Distributed file system, browse files in the Distributed file system, and log.

(3) To view the log files in the ${hadoop_home}/logs directory, Namenode, Datanode, secondary namenode, Jobtracker, tasktracker each have a corresponding log file, Each run of the compute task also has a pair of application log files. Analyzing these log files helps to find the cause of the failure.

Concluding remarks

Now that you know the fundamentals of the MapReduce computing model, Distributed File System HDFS, distributed parallel computing, and a running Hadoop environment, you run a parallel program based on Hadoop. In the next article, you'll learn how to write your own distributed parallel programs based on Hadoop and run them for a specific computing task.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More