Introduction to hadoop

Source: Internet
Author: User

Original article:

Http://www.ibm.com/developerworks/cn/opensource/os-cn-hadoop1/index.html

 

Cao Yuzhong(Caoyuz@cn.ibm.com), Software engineer, IBM China Development Center

May 22, 2008

Hadoop is an open-source distributed parallel programming framework that implements the mapreduce computing model. With hadoop, programmers can easily write distributed parallel programs and run them on computer clusters, complete the calculation of massive data. This article introduces basic concepts such as mapreduce computing model and distributed parallel computing, as well as hadoop installation, deployment, and basic running methods.

Introduction to hadoop

Hadoop is an open-source distributed parallel programming framework that can run on large-scale clusters. distributed storage is essential for distributed programming, this framework also contains a Distributed File System (HDFS ). So far, hadoop is not so widely known. Its latest version is only 0.16, and it seems that there is still a long distance between 1.0 and, however, the two other open-source projects, nutch and Lucene, which are compatible with hadoop (both of which are founder Doug cutting), are definitely well-known. Lucene is an open-source high-performance full-text retrieval toolkit developed in Java. It is not a complete application, but a simple and easy-to-use API. There are countless software systems around the world. The web site implements full-text retrieval based on Lucene. Later, Doug cutting created the first open-source Web search engine (Http://www.nutch.orgIn addition, it also contains a distributed file system used to store data. Since versions 0.8.0 and later, Doug cutting has independently developed the Distributed File System and the Code for implementing mapreduce algorithms in nutch to form a new open-source item hadoop. Nutch has also evolved into an open source search engine based on Lucene full-text search and hadoop distributed computing platform.

Based on hadoop, you can easily write distributed parallel programs that can process massive data and run them on a large-scale computer cluster consisting of hundreds of nodes. From the current situation, hadoop is destined to have a bright future: "cloud computing" is the technical term of moxibustion currently. IT companies around the world are investing and promoting this new generation of computing model, hadoop has been used by several major companies as an important basic software in its "cloud computing" environment. For example, Yahoo is using the strength of the hadoop open-source platform to confront Google, in addition to funding the hadoop Development Team, pig is also developing an open-source hadoop-based project, which is a distributed computing program focusing on massive dataset analysis. Amazon has launched Amazon S3 (Amazon Simple Storage Service) based on hadoop to provide reliable, fast, and scalable network storage services, and a commercial cloud computing platform, Amazon EC2 (Amazon Elastic Compute Cloud ). In the cloud computing project "Blue cloud plan" of IBM, hadoop is also an important basic software. Google is working with IBM to promote hadoop-based cloud computing.

 

Welcome to the transformation of programming methods

Under Moore's Law, programmers did not have to consider that computer performance could not keep up with the development of software, because about every 18 months, the CPU clock speed would double, performance will also be doubled, and the software can enjoy free performance improvement without any changes. However, since the transistor circuit has gradually approached its physical performance limit, Moore's law began to expire around 2005, humans can no longer expect a single CPU to double every 18 months, providing us with faster and faster computing performance. Chip manufacturers such as Intel, AMD, and IBM have begun to explore the performance potential of CPU from the multi-core perspective. The arrival of the multi-core era and the Internet era will make a significant change in software programming methods, multi-core concurrent programming and distributed parallel programming based on large-scale computer clusters are the main ways to improve the software performance in the future.

Many people think that this major change in programming will bring about a software concurrency crisis, because our traditional software approach is basically the sequential execution of a single command and a single data stream, this sort of execution is quite in line with human thinking habits, but not compatible with concurrent programming. Cluster-based distributed parallel programming enables software and data to run on multiple computers connected to a network at the same time. Each computer here can be a common PC. The biggest advantage of such a distributed parallel environment is that it is easy to increase computers to expand new computing nodes and thus obtain incredible massive computing power, at the same time, it has a strong fault tolerance capability, and failure of a batch of computing nodes does not affect normal computing and the correctness of the results. Google is doing this. They used a parallel programming model called mapreduce for Distributed Parallel Programming and run it on a distributed file system called GFS (Google File System, provides search services for hundreds of millions of users around the world.

Hadoop implements Google's mapreduce programming model, provides easy-to-use programming interfaces, and provides its own Distributed File System HDFS. Unlike Google, hadoop is open-source, anyone can use this framework for parallel programming. If distributed parallel programming is difficult enough to make common programmers daunting, the emergence of open-source hadoop greatly lowers the threshold. After reading this article, you will find that hadoop-based programming is very simple, without any experience in parallel development, you can easily develop distributed parallel programs and make them run on hundreds of machines at the same time in an incredible way, then, the computation of massive data is completed in a short time. You may feel that you cannot have hundreds of machines to run your parallel programs. In fact, with the popularization of "cloud computing", anyone can easily obtain such massive computing capabilities. For example, Amazon EC2, the cloud computing platform of Amazon, has already provided this type of on-demand computing leasing service. If you are interested, read it. The third part of this series of articles will be introduced.

Mastering a little knowledge about distributed parallel programming is essential for future programmers. hadoop is so easy to use. Why not try it? Maybe you have been anxious to try out what hadoop-based programming is like, but after all, this programming model is very different from the traditional sequential program, A little basic knowledge can help you better understand how hadoop-based distributed parallel programs are written and run. Therefore, this article will first introduce the mapreduce computing model, the Distributed File System HDFS in hadoop, and how hadoop implements parallel computing before introducing how to install and deploy the hadoop framework, and how to run the hadoop program.

 

Mapreduce Computing Model

Mapreduce is the core computing model of Google. It abstracts the parallel computing process that runs on a large-scale cluster into two functions: map and reduce, this is a surprising simple but powerful model. A dataset (or task) suitable for processing using mapreduce has a basic requirement: a dataset to be processed can be divided into many small datasets, and each small dataset can be processed completely in parallel.

Figure 1. mapreduce computing process

Figure 1 illustrates the process of processing large datasets using mapreduce. In short, this mapreduce computing process breaks down large datasets into hundreds of small datasets, each (or several) datasets are processed by a node in the cluster (generally a common computer) and intermediate results are generated. Then these intermediate results are merged by a large number of nodes to form the final result.

The core of the computing model is the map and reduce functions. The two functions are implemented by the user. The function is to input the <key, value> pairs are converted to another or a batch of <key, value> pairs.

Table 1 Map and reduce Functions

Function Input Output Description
Map <K1, V1> List (<k2, V2>) 1. parse the small dataset into a batch of <key, value> pairs and input the map function for processing.
2. Each input <K1, V1> outputs a batch of <k2, V2>. <K2, V2> is the intermediate result of the calculation.
Reduce <K2, list (V2)> <K3, V3> In the input intermediate result <k2, list (V2)>, list (V2) indicates that a batch of values belonging to the same K2

Take a program that calculates the number of occurrences of each word in a text file as an example. <K1, V1> can be <the offset position of a row in the file, a row in the File>, after map function ing, a batch of intermediate results <word, number of occurrences> are formed. The reduce function can process the intermediate results and accumulate the occurrences of the same word, obtain the total number of occurrences of each word.

Writing distributed parallel programs based on the mapreduce computing model is very simple. The main coding work of programmers is to implement map and reduce functions. Other complex problems in parallel programming, such as distributed storage and Job Scheduling, the mapreduce framework (such as hadoop) is responsible for load balancing, fault tolerance Processing, and network communication. programmers do not have to worry about it.

 

 

Parallel Computing on four clusters

The mapreduce computing model is suitable for running in parallel on a large-scale cluster composed of a large number of computers. In Figure 1, each map task and each reduce task can run on a separate computing node at the same time. We can imagine that the computing efficiency is very high. How can this parallel computing be done?

Data distribution and storage

HDFS in hadoop is composed of a management node (namenode) and N data nodes (datanode). Each node is a common computer. The usage is very similar to the file system on a single machine that we are familiar with. You can also create directories, create, copy, delete files, and view file content. However, at the underlying implementation level, files are cut into blocks, which are then distributed and stored on different datanode. Each block can also be copied and stored in several copies on different datanode, to achieve the purpose of fault tolerance and Disaster Tolerance. Namenode is the core of the entire HDFS. By maintaining some data structures, namenode records how many blocks each file has been cut into and which datanode can be obtained from, important information such as the status of each datanode. If you want to learn more about HDFS, read the references:The hadoop Distributed File System: architecture and design

Distributed Parallel Computing

Hadoop has a jobtracker as the main control for scheduling and managing other tasktrackers. jobtracker can run on any computer in the cluster. Tasktracker is responsible for executing tasks and must run on datanode. That is, datanode is both a data storage node and a computing node. Jobtracker distributes map tasks and reduce tasks to idle tasktracker to run these tasks in parallel and monitor the running status of the tasks. If a tasktracker fails, jobtracker transfers the task it is responsible for to another idle tasktracker for re-running.

Local computing

The computer on which the data is stored will compute the data. This will reduce the data transmission over the network and reduce the network bandwidth requirements. In a cluster-based distributed parallel system such as hadoop, computing nodes can be easily expanded, and the computing capability provided by hadoop is almost unlimited, however, since data needs to flow between different computers, network bandwidth becomes a bottleneck and is very valuable. "Local computing" is the most effective way to save network bandwidth, the industry regards this as "mobile computing is more economical than mobile data ".

Figure 2. Distributed Storage and parallel computing

Task Granularity

When a large original dataset is cut into a small dataset, the size of a small dataset is usually smaller than or equal to the size of a block in HDFS (64 MB by default ), this ensures that a small dataset is located on a computer for local computing. If there are m small datasets to be processed, start M map tasks. Note that M map tasks are distributed on n computers and run concurrently. You can specify the number of reduce tasks.

Partition

Divides the intermediate results output by map tasks into r copies in the range of keys (r is the number of pre-defined reduce tasks). Hash functions such as hash (key) are usually used for Division) moD R. This ensures that keys within a certain range must be processed by a reduce task, which can simplify the reduce process.

Combine

Before partition, you can also combine the intermediate results first to merge the <key, value> pairs with the same key in the intermediate results into a pair. The combine process is similar to the reduce process. In many cases, the reduce function can be used directly, but the combine function is executed immediately after the map function is executed. Combine can reduce the number of <key, value> pairs in the intermediate results, thus reducing network traffic.

The reduce task obtains the intermediate result from the map task node.

The intermediate results of the map task are stored in the local disk as files after combine and partition. Jobtracker is notified about the location of the intermediate result file. jobtracker then notifies the reduce task to which datanode to obtain the intermediate result. Note that the intermediate results generated by all MAP tasks are divided into R parts by their keys using the same hash function, and r reduce tasks are responsible for each key interval. Each reduce task needs to obtain intermediate results from multiple map task nodes within the key range of the worker, and then execute the reduce function to form a final result file.

Task Pipeline

If there are R reduce tasks, there will be r final results. In many cases, the R final results do not need to be merged into a final result. Because the R result can be used as an input for another computing task to start another parallel computing task.

 

5. hadoop initial experience

Hadoop supports Linux and Windows operating systems, but its official website states that hadoop's distributed operations are not strictly tested on Windows. We recommend that you only use Windows as the hadoop development platform. The installation steps in windows are as follows (the Linux platform is similar and simpler ):

(1) In Windows, install cgywin first. When installing cgywin, be sure to install OpenSSH (in net category ). After the installation is complete, add the installation directory of cgywin, such as C:/cygwin/bin, to the path of the system environment variable. This is because hadoop runs scripts and commands in Linux.

(2) install Java 1.5.x and set the java_home environment variable to the Java installation root directory, such as C:/program files/Java/jdk1.5.0 _ 01.

(3) Go To The hadoop Official WebsiteHttp://hadoop.apache.orgDownload hadoop core, the latest stable version is 0.16.0. Unzip the downloaded installation package to a directory, this article assumes that unzip to C:/hadoop-0.16.0.

4) modify the conf/hadoop-env.sh file and set the java_home environment variable: Export java_home = "C: /program files/Java/jdk1.5.0 _ 01 "(because there is a space in the program files in the path, you must use double quotation marks to cause the path)

Now, everything is ready to run hadoop. In the following running process, cygwin needs to be started to enter the simulated Linux environment. In the downloaded hadoop core package, there are several sample programs that have been packaged into a hadoop-0.16.0-examples.jar. One wordcount program is used to count the number of times each word appears in a batch of text files. Let's first see how to run this program. Hadoop has three modes: Standalone (non-distributed) mode, pseudo-distributed running mode, and distributed running mode. The first two modes do not reflect the advantages of hadoop distributed computing, it has no practical significance, but it is very helpful for program testing and debugging. Let's start with these two modes and learn how hadoop-based distributed parallel programs are written and run.

Standalone (non-distributed) Mode

This mode runs on a single machine without a distributed file system, but directly reads and writes to the file system of the local operating system.

Code List 1

$ CD/cygdrive/C/hadoop-0.16.0 $ mkdir test-in $ CD test-in # create two text files under the test-In directory, the wordcount program calculates the number of occurrences of each word. $ echo "Hello World bye world"> file1.txt $ echo "Hello hadoop goodbye hadoop"> file2.txt $ CD .. $ bin/hadoop jar hadoop-0.16.0-examples.jar wordcount test-in test-out # execution completed, the following view execution results: $ CD test-out $ cat part-00000bye 1 goodbye 1 hadoop 2 Hello 2 World 2

Note: when running the bin/hadoop jar hadoop-0.16.0-examples.jar wordcount test-in test-out, be sure to note that the first parameter is jar, not-jar, when you use-jar, the reported error message is: exception in thread "Main" Java. lang. noclassdeffounderror: ORG/Apache/hadoop/util/programdriver. At that time, I thought it was a classpath setup problem, which wasted a lot of time. By analyzing the bin/hadoop script, we can see that-jar is not a parameter defined by the bin/hadoop script. This script uses-jar as a Java parameter, the-jar parameter of Java indicates that a jar file is executed (this jar file must be an executable JAR file, that is, the main class defined in manifest ), at this time, the external definition of classpath does not work, so it will throw Java. lang. noclassdeffounderror exception. While jar is a parameter defined by the bin/hadoop script, it will call a tool class runjar of hadoop itself. This tool class can also execute a jar file, and the externally defined classpath is valid.

Pseudo distributed Running Mode

This mode is also run on a single machine, but different Java processes are used to simulate various nodes in the Distributed Operation (namenode, datanode, jobtracker, tasktracker, secondary namenode ), note the differences between these nodes in Distributed Operation:

From the perspective of distributed storage, the nodes in the cluster are composed of one namenode and several datanode, and another secondary namenode is used as the backup of namenode. From the perspective of distributed applications, nodes in the cluster are composed of one jobtracker and several tasktrackers. jobtracker is responsible for task scheduling and tasktracker is responsible for parallel task execution. Tasktracker must run on datanode to facilitate local data computing. Jobtracker and namenode do not need to be on the same machine.

(1) modify the conf/hadoop-site.xml by code listing 2. Note that CONF/hadoop-default.xml is the default hadoop parameter, and you can read this file to understand which hadoop parameters are available for configuration, but do not modify this file. You can change the default parameter value by modifying the conf/hadoop-site.xml, which overwrites the same name parameter for the conf/hadoop-default.xml.

Code List 2

                <configuration>  <property>    <name>fs.default.name</name>    <value>localhost:9000</value>  </property>  <property>    <name>mapred.job.tracker</name>    <value>localhost:9001</value>  </property>  <property>    <name>dfs.replication</name>    <value>1</value>  </property></configuration>

The fs. Default. name parameter specifies the IP address and port number of the namenode. The default value is file: //, indicating that the local file system is used for the non-distributed mode of a single machine. Here we specify the namenode running on the local localhost.

The mapred. Job. Tracker parameter specifies the IP address and port number of jobtracker. The default value is local, which indicates that jobtracker and tasktracker are executed in the same local Java Process and used in standalone non-distributed mode. Here we specify to use jobtracker running on the local localhost (using a separate Java Process as jobtracker ).

The DFS. Replication parameter specifies the number of times each block in HDFS is replicated, which acts as a redundant data backup. In a typical production system, this number is usually set to 3.

(2) Configure SSH, as shown in code listing 3:

Code List 3

                $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

After the configuration is complete, execute SSH localhost to confirm that your machine can be connected using SSH, and you do not need to manually enter the password during connection.

(3) format a new distributed file system, as shown in code list 4:

Code list 4

                $ cd /cygdrive/c/hadoop-0.16.0$ bin/hadoop namenode –format 

(4) Start the hadoop process, as shown in code list 5. The output information on the console should show that namenode, datanode, secondary namenode, jobtracker, and tasktracker are enabled. After the startup is complete, you can see that five new Java processes have been started through PS-ef.

Code List 5

                $ bin/start-all.sh  $ ps –ef

(5) run the wordcount application, as shown in code listing 6:

Code List 6

$ Bin/hadoop DFS-put. /test-in input. copy the/test-In directory to the root directory of HDFS. Change the directory name to input # Run bin/hadoop DFS-help to learn how to use various HDFS commands. $ Bin/hadoop jar hadoop-0.16.0-examples.jar wordcount input output # view execution results: # copy the file from HDFS to the local file system and then view: $ bin/hadoop DFS-Get output $ cat output/* # You can also directly view $ bin/hadoop DFS-cat output/* $ bin/stop-all.sh # Stop the hadoop Process

Fault Diagnosis

(1) After executing $ bin/start-all.sh to start the hadoop process, five Java processes are started, and five PID files are created under the/tmp directory to record these process IDs. The five files show the Java processes corresponding to namenode, datanode, secondary namenode, jobtracker, and tasktracker. When you think that hadoop is not working properly, you can first check whether the five Java processes are running normally.

(2) Use web interfaces. Access http: // localhost: 50030 to view the running status of jobtracker. Access http: // localhost: 50060 to view the running status of tasktracker. Access http: // localhost: 50070 to view the status of namenode and the entire Distributed File System, and view files and logs in the distributed file system.

(3) view the log files under the $ {hadoop_home}/logs directory, namenode, datanode, secondary namenode, jobtracker, and tasktracker each have a corresponding log file, each running computing task also has application log files. Analyzing these log files helps you find the cause of the fault.

 

Conclusion

Now you know the basic principles of mapreduce computing models, Distributed File System HDFS, and distributed parallel computing, and have a hadoop environment that can run, runs a hadoop-based parallel program. In the next article, you will learn how to write your own distributed parallel program based on hadoop for a specific computing task and deploy and run it.

Statement: This article only represents the author's personal point of view and does not represent the point of view of IBM.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.