People rely on search engines every day to find specific content from the massive amount of data on the Internet. But have you ever wondered how these searches are executed? One method is Apache Hadoop, which is a software framework that can process massive data in a distributed manner. An Application of Hadoop is to index Internet Web pages in parallel. Hadoop is! Apache projects supported by companies such as Google and IBM. This article will introduce the Hadoop framework and show why it is one of the most important Linux-based Distributed Computing frameworks.
Hadoop was officially introduced by Apache Software Foundation as part of Lucene's sub-project, Nutch, in the fall of 2005. It was inspired by MapReduce and Google File System first developed by Google Lab. In July March 2006, MapReduce and Nutch Distributed File System (NDFS) were included in a project called Hadoop respectively.
Hadoop is the most popular tool for classifying search keywords on the Internet, but it can also solve many problems requiring great scalability. For example, if you wantgrep
What happens to a 10 TB giant file? In traditional systems, this takes a long time. However, Hadoop can greatly improve the efficiency because these problems are taken into account during design.
Prerequisites
Hadoop is a software framework that can process large amounts of data in a distributed manner. However, Hadoop is processed in a reliable, efficient, and scalable manner. Hadoop is reliable because it assumes that computing elements and storage will fail, so it maintains multiple copies of work data to ensure that it can be redistributed for failed nodes. Hadoop is efficient because it works in parallel and accelerates processing through parallel processing. Hadoop is still scalable and can process petabytes of data. In addition, Hadoop depends on Community servers, so it is relatively low cost and can be used by anyone.
You may have thought that Hadoop is ideal for running on the Linux production platform because it has a framework written in Java. Applications on Hadoop can also be written in other languages, such as C ++.
Hadoop Architecture
Hadoop has many elements. The bottom is Hadoop Distributed File System (HDFS), which stores files on all storage nodes in the Hadoop cluster. The previous layer of HDFS (for this article) is the MapReduce engine, which consists of JobTrackers and TaskTrackers.
HDFS
For external clients, HDFS is like a traditional hierarchical file system. You can create, delete, move, or rename an object. However, the HDFS architecture is built based on a specific set of nodes (see figure 1), which is determined by its own characteristics. These nodes includeNamenode(Only one), which provides metadata services within HDFS;DatanodeIt provides storage blocks for HDFS. Since only one NameNode exists, this is a drawback of HDFS (single point of failure ).
Figure 1. Simplified View of hadoop Cluster
Files stored in HDFS are divided into blocks and copied to multiple computers (DataNode ). This is very different from the traditional RAID architecture. The size of the block (usually 64 MB) and the number of copied blocks are determined by the client when the file is created. NameNode can control all file operations. All communication within HDFS is based on the standard TCP/IP protocol.
NameNode
NameNode is a software usually run on a separate machine in an HDFS instance. It manages the file system namespace and controls access to external clients. NameNode determines whether to map the file to the copy block on DataNode. For the three most common replication blocks, the first replication block is stored on different nodes in the same rack, and the last replication block is stored on a node in different racks. Note: You need to understand the cluster architecture.
Real I/O transactions do not pass through NameNode. Only metadata that indicates the ing between DataNode and block files goes through NameNode. When an external client sends a request to create a file, NameNode responds with the block ID and the DataNode IP address of the first copy of the block. This NameNode also notifies other DataNode that will receive the copy of the block.
NameNode stores all information about the file system namespace in a file called FsImage. This file and a record file containing all transactions (EditLog here) will be stored in the local file system of NameNode. Copy the FsImage and EditLog files to prevent file corruption or loss of the NameNode system.
DataNode
NameNode is also a software usually run on a separate machine in an HDFS instance. The Hadoop cluster contains a NameNode and a large number of DataNode. DataNode is usually organized as a rack, which connects all systems through a switch. One assumption of Hadoop is that the transmission speed between nodes in the rack is faster than that between nodes in the rack.
DataNode responds to read/write requests from HDFS clients. They also respond to commands for creating, deleting, and copying blocks from NameNode. NameNode depends on the heartbeat message from each DataNode. Each message contains a block report. NameNode can verify the block ing and other file system metadata based on this report. If DataNode cannot send heartbeat messages, NameNode will take repair measures to re-copy the lost blocks on the node.
File Operations
As you can see, HDFS is not a 10 thousand-bit file system. Its main purpose is to support access to large files written in the form of a stream. If the client wants to write the file to HDFS, it must first cache the file to a local temporary storage. If the cached data is larger than the required HDFS block size, the file creation request is sent to NameNode. NameNode will be identified by DataNode and the target block will respond to the client. It also notifies the DataNode that the file block copy will be saved. When the client starts to send a temporary file to the first DataNode, it will immediately forward the block content to the copy DataNode through the pipeline. The client is also responsible for creating a checksum file stored in the same HDFS namespace. After the last file block is sent, NameNode submits the file creation to its persistent metadata storage (in the EditLog and FsImage files ).
Linux cluster
The Hadoop framework can be used on a single Linux platform (during development and debugging), but commercial servers stored on racks can exert their power. These racks form a Hadoop cluster. It uses cluster topology knowledge to determine how to allocate jobs and files throughout the cluster. Hadoop assumes that the node may fail, so the local method is used to handle the failure of a single computer or even all racks.
Hadoop Application
One of the most common usage of Hadoop is Web search. Although it is not the only software framework application, as a parallel data processing engine, it has outstanding performance. One of the most interesting aspects of Hadoop is the Map and Reduce process, which was inspired by Google's development. This process is called index creation. It uses the text Web pages retrieved by Web crawlers as input and reports the frequency of words on these pages as the result. Then, you can use this result throughout the Web search process to identify the content from the defined search parameters.
MapReduce
The simplest MapReduce application contains at least three parts: a Map function, a Reduce function, and a main function. The main function combines job control with file input/output. In this regard, Hadoop provides a large number of interfaces and abstract classes, thus providing many tools for Hadoop application developers for debugging and performance measurement.
Mapreduce is a software framework used to process large datasets in parallel. Mapreduce is rooted in functional programming.map
Andreduce
Letter count. It consists of two operations that may contain many instances (many map and reduce. The map function accepts a set of data and converts it to a list of key/value pairs. Each element in the input field corresponds to a key/value pair. Reduce functions accept the list generated by the map function, and then narrow down the list of key/value pairs based on their keys (generate a key/value pair for each key.
Here is an example to help you understand it. Assume that the input field isone small step for man, one giant leap for mankind
. Running the map function on this domain generates the following list of key/value pairs:
(one, 1) (small, 1) (step, 1) (for, 1) (man, 1) (one, 1) (giant, 1) (leap, 1) (for, 1) (mankind, 1)
|
If the reduce function is applied to the list of key/value pairs, the following key/value pairs are obtained:
(one, 2) (small, 1) (step, 1) (for, 2) (man, 1) (giant, 1) (leap, 1) (mankind, 1)
|
The result is to count the words in the input field, which is undoubtedly very useful for processing the index. However, assume that there are two input fields. The first one isone small step for man
And the second isone giant leap for mankind
. You can execute the map function and reduce function on each domain, and then apply the two key/value pairs to another reduce function. Then, the result is the same as the previous one. In other words, the same operation can be used in parallel in the input domain, and the results are the same, but the speed is faster. This is the power of mapreduce; its parallel functions can be used on any number of systems. Figure 2 demonstrates this idea in the form of segments and iterations.
Figure 2. Concept flow of mapreduce Process
Now back to Hadoop, how does it implement this function? A MapReduce application that represents the startup of a client on a single primary system is called JobTracker. Similar to NameNode, it is the only system in the Hadoop cluster that controls MapReduce applications. After the application is submitted, the input and output directories contained in HDFS are provided. JobTracker uses the file block information (physical quantity and location) to determine how to create other TaskTracker subordinate tasks. The MapReduce application is copied to each node that contains the input file block. A unique subordinate task is created for each file block on a specific node. Each TaskTracker reports the status and completion information to JobTracker. Figure 3 shows the working distribution in a sample cluster.
Figure 3. hadoop cluster with physical distribution of processing and storage
This feature of Hadoop is very important because it does not move the storage to a certain location for processing, but to the storage. This is adjusted according to the number of nodes in the cluster, so it supports efficient data processing.
Other Hadoop applications
Hadoop is a multi-functional framework used to develop distributed applications. from different perspectives, it is a good way to make full use of Hadoop. Review Figure 2. The process appears in the form of a tiered function. One component uses the result of another component. Of course, it is not a omnipotent development tool, but if the problem is the case, you can choose to use Hadoop.
Hadoop has been helping solve various problems, including sorting super large datasets and searching large files. It is also the core of various search engines, such as Amazon A9 and the Able Grape vertical search engine used to find wine information. Hadoop Wiki provides a list of a large number of applications and companies that use Hadoop in various ways (see references ).
Before, Yahoo! It has the largest hadoop Linux production architecture, which consists of more than 10,000 kernels and stores more than 5 Pb bytes to various datanode. There are almost 1 trillion links in their web indexes. However, you may not need such a large system. If so, you can use Amazon Elastic Compute Cloud (EC2) to build a virtual cluster with 20 nodes. In fact,New York TimesUse hadoop and EC2 to upload 4 TB tiff images within 36 hours-including 405 K big tiff images, 3.3 m SGML articles and 405 k xml files-convert to 800 k png images suitable for Web use. This kind of processing is called cloud computing. It is a unique way to demonstrate the power of hadoop.
Conclusion
There is no doubt that hadoop is becoming more and more powerful. From the application that uses it, its future is bright. You can learn more about hadoop and its applications from the reference section, including suggestions for setting up your own hadoop cluster.
References
Learning
- For more information, see the original article on the developerworks global website.
- Hadoop core web site is the best resource for learning hadoop. You can find the latest documents, Quick Start Guide, tutorials, and detailed information on setting cluster configurations. You can also find the detailed application programming interface (API) Documentation for development on the hadoop framework.
- Hadoop DFS User Guide introduces HDFS and its related components.
- Yahoo! In early 2008, the largest Hadoop cluster was launched for its search engine. The Hadoop cluster consists of more than 10,000 cores and provides raw disk storage of more than 5 Pb (equivalent to 5000,000 GB.
- "Hadoop: Funny Name, Powerful Software" (LinuxInsider, February November 2008) is an excellent article about Hadoop, including an interview with Doug Cutting, founder of Hadoop. This article also discussesNew York TimesCombined with Hadoop and Amazon EC2 for massive image conversion.
- Hadoop is very suitable for use in cloud computing environments. For more information about cloud computing, see "cloud computing on Linux" (developerWorks, December September 2008 ).
- The Hadoop Wiki PoweredBy page shows a complete list of Hadoop applications. In addition to search engines, Hadoop can solve many other problems.
- "Running Hadoop on Ubuntu Linux (Multi-Node Cluster)" is a tutorial written by Michael Noll. It teaches you how to set up a Hadoop Cluster. This tutorial also mentions another earlier tutorial on how to set up a single node.
- In the developerWorks Linux area, you can find more references for Linux developers (including Linux beginners ), you can also read the most popular articles and tutorials.
- Read all Linux tips and Linux tutorials on developerWorks.
- Stay tuned to developerWorks technical events and network broadcasts.
Source: http://www.ibm.com/developerworks/cn/linux/l-hadoop/