Distributed Parallel Programming with hadoop, part 1

Source: Internet
Author: User
Tags hadoop wiki

Basic concepts and installation and deployment

Cao Yuzhong (caoyuz@cn.ibm.com ),
Software Engineer, IBM China Development Center

 

Introduction:Hadoop is an open-source distributed parallel programming framework that implements the mapreduce computing model. With hadoop, programmers can easily write distributed parallel programs and run them on computer clusters, complete the calculation of massive data. This article introduces basic concepts such as mapreduce computing model and distributed parallel computing, as well as hadoop installation, deployment, and basic running methods.

 

Mark this article!

 

Release date:May 22, 2008
Level:Elementary
Access status7818 views
Suggestion:0 (Add Comment)

Average score
(19 in total)

 

Hadoop is an open-source distributed parallel programming framework that can run on large-scale clusters. distributed storage is essential for distributed programming, this framework also contains a Distributed File System (HDFS ). So far, hadoop is not so widely known. Its latest version is only 0.16, and it seems that there is still a long distance between 1.0 and, however, the two other open-source projects, nutch and Lucene, which are compatible with hadoop (both of which are founder Doug cutting), are definitely well-known. Lucene
Is an open-source high-performance full-text search toolkit developed in Java. It is not a complete application, but a simple and easy-to-use API. In the world, there are countless software systems, Web sites based on Lucene to achieve the full-text retrieval function, and later Doug cutting created the first open source web search engine (http://www.nutch.org) nutch, it adds web crawlers and web-related functions based on Lucene, and some plug-ins for parsing various file formats. In addition
It also contains a distributed file system used to store data. Since versions 0.8.0 and later, Doug cutting has independently developed the Distributed File System and the Code for implementing mapreduce algorithms in nutch to form a new open-source item hadoop. Nutch has also evolved into an open source search engine based on Lucene full-text search and hadoop distributed computing platform.

Based on hadoop, you can easily write distributed parallel programs that can process massive data and run them on a large-scale computer cluster consisting of hundreds of nodes. From the current situation, hadoop is destined to have a bright future: "cloud computing" is the technical term of moxibustion currently. IT companies around the world are investing and promoting this new generation of computing model, hadoop has been used by several major companies as an important basic software in its "cloud computing" environment. For example, Yahoo is using the strength of the hadoop open-source platform to confront Google, in addition to funding the hadoop Development Team, pig is also developing an open-source hadoop-based project, which is a distributed computing program focusing on massive dataset analysis. Amazon
Based on hadoop, the company launched Amazon S3 (Amazon simple storage service) to provide reliable, fast, and scalable network storage services, and a commercial cloud computing platform, Amazon EC2 (Amazon Elastic Compute Cloud ). In the cloud computing project "Blue cloud plan" of IBM, hadoop is also an important basic software. Google is working with IBM to promote hadoop-based cloud computing.

Back to Top

Under Moore's Law, programmers did not have to consider that computer performance could not keep up with the development of software, because about every 18 months, the CPU clock speed would double, performance will also be doubled, and the software can enjoy free performance improvement without any changes. However, since the transistor circuit has gradually approached its physical performance limit, Moore's law began to expire around 2005, humans can no longer expect a single CPU to double every 18 months, providing us with faster and faster computing performance. Chip manufacturers such as Intel, AMD, and IBM have begun to explore the performance potential of CPU from the multi-core perspective. The arrival of the multi-core era and the Internet era will make a significant change in software programming methods, multi-core concurrent programming and distributed parallel programming based on large-scale computer clusters are the main ways to improve the software performance in the future.

Many people think that this major change in programming will bring about a software concurrency crisis, because our traditional software approach is basically the sequential execution of a single command and a single data stream, this sort of execution is quite in line with human thinking habits, but not compatible with concurrent programming. Cluster-based distributed parallel programming enables software and data to run on multiple computers connected to a network at the same time. Each computer here can be a common PC. The biggest advantage of such a distributed parallel environment is that it is easy to increase computers to expand new computing nodes and thus obtain incredible massive computing power, at the same time, it has a strong fault tolerance capability, and failure of a batch of computing nodes does not affect normal computing and the correctness of the results. Google
In this case, they used the parallel programming model called mapreduce for Distributed Parallel Programming and run it on a distributed file system called GFS (Google File System, provides search services for hundreds of millions of users around the world.

Hadoop implements Google's mapreduce programming model, provides easy-to-use programming interfaces, and provides its own Distributed File System HDFS. Unlike Google, hadoop is open-source, anyone can use this framework for parallel programming. If distributed parallel programming is difficult enough to make common programmers daunting, the emergence of open-source hadoop greatly lowers the threshold. After reading this article, you will find that hadoop-based programming is very simple, without any experience in parallel development, you can easily develop distributed parallel programs and make them run on hundreds of machines at the same time in an incredible way, then, the computation of massive data is completed in a short time. You may feel that you cannot have hundreds of machines to run your parallel programs. In fact, with the popularization of "cloud computing", anyone can easily obtain such massive computing capabilities.
For example, Amazon EC2, the cloud computing platform of Amazon, has already provided this type of on-demand computing leasing service. If you are interested, read it. The third part of this series of articles will be introduced.

Mastering a little knowledge about distributed parallel programming is essential for future programmers. hadoop is so easy to use. Why not try it? Maybe you have been anxious to try out what hadoop-based programming is like, but after all, this programming model is very different from the traditional sequential program, A little basic knowledge can help you better understand how hadoop-based distributed parallel programs are written and run. Therefore, this article will first introduce the mapreduce computing model, the Distributed File System HDFS in hadoop, and how hadoop implements parallel computing before introducing how to install and deploy the hadoop framework and how to run hadoop
Program.

Back to Top

Mapreduce is the core computing model of Google. It abstracts the parallel computing process that runs on a large-scale cluster into two functions: map and reduce, this is a surprising simple but powerful model. A dataset (or task) suitable for processing using mapreduce has a basic requirement: a dataset to be processed can be divided into many small datasets, and each small dataset can be processed completely in parallel.

 

Figure 1 illustrates the process of processing large datasets using mapreduce. In short, this mapreduce computing process breaks down large datasets into hundreds of small datasets, each (or several) datasets are processed by a node in the cluster (generally a common computer) and intermediate results are generated. Then these intermediate results are merged by a large number of nodes to form the final result.

The core of the computing model is the map and reduce functions. The two functions are implemented by the user. The function is to input the <key, value> pairs are converted to another or a batch of <key, value> pairs.

Function Input Output Description
Map <K1, V1> List (<k2, V2>) 1. parse the small dataset into a batch of <key, value> pairs and input the map function for processing.
2. Each input <K1, V1> outputs a batch of <k2, V2>. <K2, V2> is the intermediate result of the calculation.
Reduce <K2, list (V2)> <K3, V3> In the input intermediate result <k2, list (V2)>, list (V2) indicates that a batch of values belonging to the same K2

Take a program that calculates the number of occurrences of each word in a text file as an example. <K1, V1> can be <the offset position of a row in the file, a row in the File>, after map function ing, a batch of intermediate results <word, number of occurrences> are formed. The reduce function can process the intermediate results and accumulate the occurrences of the same word, obtain the total number of occurrences of each word.

Writing distributed parallel programs based on the mapreduce computing model is very simple. The main coding work of programmers is to implement map and reduce functions. Other complex problems in parallel programming, such as distributed storage and Job Scheduling, the mapreduce framework (such as hadoop) is responsible for load balancing, fault tolerance Processing, and network communication. programmers do not have to worry about it.

Back to Top

The mapreduce computing model is suitable for running in parallel on a large-scale cluster composed of a large number of computers. In Figure 1, each map task and each reduce task can run on a separate computing node at the same time. We can imagine that the computing efficiency is very high. How can this parallel computing be done?

HDFS in hadoop is composed of a management node (namenode) and N data nodes (datanode). Each node is a common computer. The usage is very similar to the file system on a single machine that we are familiar with. You can also create directories, create, copy, delete files, and view file content. However, at the underlying implementation level, files are cut into blocks, which are then distributed and stored on different datanode. Each block can also be copied and stored in several copies on different datanode, to achieve the purpose of fault tolerance and Disaster Tolerance. Namenode is the core of the entire HDFS. It maintains some data structures and records how many files are cut.
Block, which can be obtained from the datanode and important information such as the status of each datanode. For more information about HDFS, see the hadoop Distributed File System: architecture and design.

Hadoop has a jobtracker as the main control for scheduling and managing other tasktrackers. jobtracker can run on any computer in the cluster. Tasktracker is responsible for executing tasks and must run on datanode. That is, datanode is both a data storage node and a computing node. Jobtracker distributes map tasks and reduce tasks to idle tasktracker to run these tasks in parallel and monitor the running status of the tasks. If a tasktracker fails, jobtracker transfers the task it is responsible for to another idle
Tasktracker re-runs.

The computer on which the data is stored will compute the data. This will reduce the data transmission over the network and reduce the network bandwidth requirements. In a cluster-based distributed parallel system such as hadoop, computing nodes can be easily expanded, and the computing capability provided by hadoop is almost unlimited, however, since data needs to flow between different computers, network bandwidth becomes a bottleneck and is very valuable. "Local computing" is the most effective way to save network bandwidth, the industry regards this as "mobile computing is more economical than mobile data ".

 

When a large original dataset is cut into a small dataset, the size of a small dataset is usually smaller than or equal to the size of a block in HDFS (64 MB by default ), this ensures that a small dataset is located on a computer for local computing. If there are m small datasets to be processed, start M map tasks. Note that M map tasks are distributed on n computers and run concurrently. You can specify the number of reduce tasks.

Divides the intermediate results output by map tasks into r copies in the range of keys (r is the number of pre-defined reduce tasks). Hash functions such as hash (key) are usually used for Division) moD R. This ensures that keys within a certain range must be processed by a reduce task, which can simplify the reduce process.

Before partition, you can also combine the intermediate results first to merge the <key, value> pairs with the same key in the intermediate results into a pair. The combine process is similar to the reduce process. In many cases, the reduce function can be used directly, but the combine function is executed immediately after the map function is executed. Combine can reduce the number of <key, value> pairs in the intermediate results, thus reducing network traffic.

The intermediate results of the map task are stored in the local disk as files after combine and partition. Jobtracker is notified about the location of the intermediate result file. jobtracker then notifies the reduce task to which datanode to obtain the intermediate result. Note that the intermediate results generated by all MAP tasks are divided into R parts by their keys using the same hash function, and r reduce tasks are responsible for each key interval. Each reduce task needs to obtain intermediate results from multiple map task nodes within the key range in which it is responsible, and then execute reduce
Function to form a final result file.

If there are R reduce tasks, there will be r final results. In many cases, the R final results do not need to be merged into a final result. Because the R result can be used as an input for another computing task to start another parallel computing task.

Back to Top

Hadoop supports Linux and Windows operating systems, but its official website states that hadoop's distributed operations are not strictly tested on Windows. We recommend that you only use Windows as the hadoop development platform. The installation steps in windows are as follows (the Linux platform is similar and simpler ):

(1) In Windows, install cgywin first. When installing cgywin, be sure to install OpenSSH (in net category ). After the installation is complete, add the installation directory of cgywin, such as C:/cygwin/bin, to the path of the system environment variable. This is because hadoop runs scripts and commands in Linux.

(2) install Java 1.5.x and set the java_home environment variable to the Java installation root directory, such as C:/program files/Java/jdk1.5.0 _ 01.

(3) To the hadoop official website http://hadoop.apache.org to download hadoop core, the latest stable version is 0.16.0. will download the installation package to a directory, this article assumes to unzip to C:/hadoop-0.16.0.

4) modify the conf/hadoop-env.sh file and set the java_home environment variable: Export java_home = "C: /program files/Java/jdk1.5.0 _ 01 "(because there is a space in the program files in the path, you must use double quotation marks to cause the path)

Now, everything is ready to run hadoop. In the following running process, cygwin needs to be started to enter the simulated Linux environment. In the downloaded hadoop core package, there are several sample programs that have been packaged into a hadoop-0.16.0-examples.jar. One wordcount program is used to count the number of times each word appears in a batch of text files. Let's first see how to run this program. Hadoop has three modes: Standalone (non-distributed) mode, pseudo-distributed running mode, and distributed running mode. The first two modes do not reflect the advantages of hadoop distributed computing, it does not have any practical significance, but it is very helpful for testing and debugging the program. Let's start with these two modes to understand
How is hadoop's distributed parallel program written and run.

This mode runs on a single machine without a distributed file system, but directly reads and writes to the file system of the local operating system.

$ CD/cygdrive/C/hadoop-0.16.0
$ Mkdir test-in
$ CD test-in
# Create two text files in the test-In directory. The wordcount program will count the number of occurrences of each word.
$ Echo "Hello World bye world"> file1.txt
$ Echo "Hello hadoop goodbye hadoop"> file2.txt
$ CD ..
$ Bin/hadoop jars hadoop-0.16.0-examples.jar wordcount test-in test-out
# After the execution is completed, view the execution result below:
$ CD test-out
$ Catpart-00000
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2
Note: when running the bin/hadoop jar hadoop-0.16.0-examples.jar wordcount test-in test-out, be sure to note that the first parameter is jar, not-jar, when you use-jar, the reported error message is: exception in thread "Main" Java. lang. noclassdeffounderror: ORG/Apache/hadoop/util/programdriver. At that time, I thought it was a classpath setup problem, which wasted a lot of time. By analyzing the bin/hadoop script, we can see that-jar is not a parameter defined by the bin/hadoop script. This script uses-jar as a Java parameter, the-jar parameter of Java indicates that a jar file is executed (this jar file must be an executable JAR file, that is, the main class defined in manifest ), at this time, the external definition of classpath does not work, so it will throw Java. lang. noclassdeffounderror exception. While jar is a parameter defined by the bin/hadoop script, it will call a tool class runjar of hadoop itself. This tool class can also execute a jar file, and the externally defined classpath is valid.

This mode is also run on a single machine, but different Java processes are used to simulate various nodes in the Distributed Operation (namenode, datanode, jobtracker, tasktracker, secondary namenode ), note the differences between these nodes in Distributed Operation:

From the perspective of distributed storage, the nodes in the cluster are composed of one namenode and several datanode, and another secondary namenode is used as the backup of namenode. From the perspective of distributed applications, nodes in the cluster are composed of one jobtracker and several tasktrackers. jobtracker is responsible for task scheduling and tasktracker is responsible for parallel task execution. Tasktracker must run on datanode to facilitate local data computing. Jobtracker and namenode do not need to be on the same machine.

(1) modify the conf/hadoop-site.xml by code listing 2. Note that CONF/hadoop-default.xml is the default hadoop parameter, and you can read this file to understand which hadoop parameters are available for configuration, but do not modify this file. You can change the default parameter value by modifying the conf/hadoop-site.xml, which overwrites the same name parameter for the conf/hadoop-default.xml.

<Configuration>
<Property>
<Name> fs. Default. Name </Name>
<Value> localhost: 9000 </value>
</Property>
<Property>
<Name> mapred. Job. Tracker </Name>
<Value> localhost: 9001 </value>
</Property>
<Property>
<Name> DFS. Replication </Name>
<Value> 1 </value>
</Property>
</Configuration>

The fs. Default. name parameter specifies the IP address and port number of the namenode. The default value is file: //, indicating that the local file system is used for the non-distributed mode of a single machine. Here we specify the namenode running on the local localhost.

The mapred. Job. Tracker parameter specifies the IP address and port number of jobtracker. The default value is local, which indicates that jobtracker and tasktracker are executed in the same local Java Process and used in standalone non-distributed mode. Here we specify to use jobtracker running on the local localhost (using a separate Java Process as jobtracker ).

The DFS. Replication parameter specifies the number of times each block in HDFS is replicated, which acts as a redundant data backup. In a typical production system, this number is usually set to 3.

(2) Configure SSH, as shown in code listing 3:

$ Ssh-keygen-t dsa-p'-f ~ /. Ssh/id_dsa
$ Cat ~ /. Ssh/id_dsa.pub> ~ /. Ssh/authorized_keys

After the configuration is complete, execute SSH localhost to confirm that your machine can be connected using SSH, and you do not need to manually enter the password during connection.

(3) format a new distributed file system, as shown in code list 4:

$ CD/cygdrive/C/hadoop-0.16.0
$ Bin/hadoop namenode-format
(4) Start the hadoop process, as shown in code list 5. The output information on the console should show that namenode, datanode, secondary namenode, jobtracker, and tasktracker are enabled. After the startup is complete, you can see that five new Java processes have been started through PS-ef.
$ Bin/start-all.sh
$ PS-ef
(5) run the wordcount application, as shown in code listing 6:
$ Bin/hadoop DFS-put./test-in input
# Copy the./test-In directory on the local file system to the root directory of HDFS, and change the directory name to input
# Run bin/hadoop DFS-help to learn how to use various HDFS commands.
$ Bin/hadoop jars hadoop-0.16.0-examples.jar wordcount Input Output
# View execution results:
# Copy the file from HDFS to the local file system and view it again:
$ Bin/hadoop DFS-Get output
$ Cat output /*
# You can also directly view
$ Bin/hadoop DFS-cat output /*
$ Bin/stop-all.sh # Stop hadoop Process
(1) After executing $ bin/start-all.sh to start the hadoop process, five Java processes are started, and five PID files are created under the/tmp directory to record these process IDs. The five files show the Java processes corresponding to namenode, datanode, secondary namenode, jobtracker, and tasktracker. When you think that hadoop is not working properly, you can first check whether the five Java processes are running normally.

(2) Use web interfaces. Access http: // localhost: 50030 to view the running status of jobtracker. Access http: // localhost: 50060 to view the running status of tasktracker. Access http: // localhost: 50070 to view the status of namenode and the entire Distributed File System, and view files and logs in the distributed file system.

(3) view the log files under the $ {hadoop_home}/logs directory, namenode, datanode, secondary namenode, jobtracker, and tasktracker each have a corresponding log file, each running computing task also has application log files. Analyzing these log files helps you find the cause of the fault.

Back to Top

Now you know the basic principles of mapreduce computing models, Distributed File System HDFS, and distributed parallel computing, and have a hadoop environment that can run, runs a hadoop-based parallel program. In the next article, you will learn how to write your own distributed parallel program based on hadoop for a specific computing task and deploy and run it.

Statement: This article only represents the author's personal point of view and does not represent the point of view of IBM.

Learning

  • Visit the official hadoop website to learn about hadoop and its sub-project hbase.
  • On hadoop wiki, there are many hadoop user documents, development documents, sample programs, and so on.
  • Read the Google mapreduce Thesis: mapreduce: simplified data processing on large clusters to learn more about the mapreduce computing model.
  • Learning hadoop Distributed File System HDFS: The hadoop Distributed File System: architecture and design
  • Learn about the Google File System GFS: the Google file system. hadoop HDFS provides similar functions as gfs.

Discussion

  • Add to the hadoop developer email list to learn about the latest development progress of the hadoop project.

Cao Yuzhong has a master's degree in Computer Software and theory at Beijing University of Aeronautics and Astronautics. He has several years of development experience in C language, Java, database, and telecom billing software in UNIX environment, his technical interests include osgi and search technologies. He is currently engaged in the development of system management software at the IBM China Systems and Technology Lab and can contact him through caoyuz@cn.ibm.com.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.