Use Linux and Hadoop for distributed computing

Last Update:2014-12-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

People rely on search engines every day to find specific content from the vast Internet data, but have you ever wondered how these searches were performed? One way is Apache's Hadoop, a software framework that distributes huge amounts of data. One application for Hadoop is to index Internet Web pages in parallel. Hadoop is an Apache project backed by companies like Yahoo !, Google, and IBM. This article introduces the Hadoop framework and shows why it is one of the most important Linux-based distributed computing frameworks.

Hadoop was officially introduced by the Apache Software Foundation in the fall of 2005 as part of Lucene's sub-project Nutch. Inspired by MapReduce and Google File System, pioneered by Google Labs. In March 2006, MapReduce and Nutch Distributed File System (NDFS) were included in a project called Hadoop.

Hadoop is the most popular tool for categorizing search keywords on the Internet, but it also addresses many of the issues that require great scalability. For example, what happens if you want to grep a huge 10TB file? On traditional systems this will take a long time. However, Hadoop takes these issues into consideration at design time and can therefore greatly improve efficiency.

prerequisites

Hadoop is a software framework that enables distributed processing of large amounts of data. But Hadoop is handled in a reliable, efficient and scalable way. Hadoop is reliable because it assumes that computing elements and storage will fail, so it maintains multiple copies of the working data, ensuring redistribution of the failed nodes. Hadoop is efficient because it works in parallel and speeds up processing through parallel processing. Hadoop is also scalable to handle petabytes of data. In addition, Hadoop relies on community servers, so it costs less and is available to anyone.

As you may have imagined, Hadoop is ideal for running on Linux production platforms because of the framework written in Java. Hadoop applications can also be written in other languages, such as C ++.

Hadoop architecture

Hadoop has many elements. At the bottom is the Hadoop Distributed File System (HDFS), which stores files on all storage nodes in a Hadoop cluster. The upper layer of HDFS (for this article) is the MapReduce engine, which consists of JobTrackers and TaskTrackers.

HDFS

For external clients, HDFS is like a traditional hierarchical file system. You can create, delete, move or rename files, and more. However, the architecture of HDFS is based on a specific set of nodes (see Figure 1), which is determined by its own characteristics. These nodes include NameNode (only one), which provides metadata services inside HDFS; DataNode, which provides storage blocks for HDFS. This is a disadvantage of HDFS (single point failure) because there is only one NameNode.

Figure 1. Hadoop cluster simplified view

QTUFBQVYwdExxT05ZMTM1LmdpZg == "src =" http://www.it165.net/uploadfile/files/2014/0423/20140423082339124.gif "title =" figure1 (1) .gif "/>

Files stored in HDFS are divided into blocks and then copied to multiple computers (DataNode). This is very different from the traditional RAID architecture. The size of the block (usually 64MB) and the number of blocks copied are determined by the client when creating the file. NameNode can control all file operations. All communications within HDFS are based on the standard TCP / IP protocol.

NameNode

NameNode is a piece of software that typically runs on a separate machine in an HDFS instance. It manages the file system namespace and controls the access of external clients. The NameNode decides whether to map the file to a duplicate block on the DataNode. For the three most common copy blocks, the first copy block is stored on a different node in the same rack and the last copy block is stored on a node in a different rack. Note that you need to understand the cluster architecture here.

The actual I / O transaction does not pass through the NameNode, only the metadata representing the file mapping of the DataNode and the block passes through the NameNode. When an external client sends a request to create a file, the NameNode responds with the block ID and the DataNode IP address of the first copy of the block. The NameNode also notifies other DataNode that will receive a copy of the block.

NameNode stores all information about the file system namespace in a file called FsImage. This file and a log file containing all transactions (here EditLog) are stored on the NameNode's local file system. FsImage and EditLog files also require a copy to prevent file corruption or loss of NameNode system. Note: Metadata: Specifically, in the data warehouse system, the metadata mechanism mainly supports the following five types of system management functions: (1) Describe what data is in the data warehouse; (2) Define the data to be entered in the data warehouse and Data generated from the data warehouse; (3) Record the timing of data extraction tasks that follow the occurrence of business events; (4) Record and test system data consistency requirements and implementation; and (5) Measure data quality.

DataNode

NameNode is also a piece of software that typically runs on a separate machine in an HDFS instance. Hadoop clusters include a NameNode and a large number of DataNodes. DataNodes are usually organized in racks that connect all systems through a single switch. One assumption of Hadoop is that the transfer speed between nodes in the rack is faster than the transfer speed between nodes in the rack.

DataNode responds to read and write requests from HDFS clients. They also respond to commands that create, delete, and copy blocks from the NameNode. NameNode relies on regular heartbeat messages from each DataNode. Each message contains a block report from which the NameNode can validate the block map and other file system metadata. If the DataNode can not send a heartbeat message, the NameNode takes corrective action to redo the missing blocks on that node.

File operation

Can be seen, HDFS is not a universal file system. Its main purpose is to support streaming access to large files that are written. If the client wants to write a file to HDFS, you first need to cache the file to a local temporary store. If the cached data is larger than the required HDFS block size, the request to create the file is sent to the NameNode. The NameNode will respond to the client with the DataNode ID and the target block. It also informs the DataNode that a copy of the file is to be saved. When a client begins sending temporary files to the first DataNode, the block contents are immediately piped to the replica DataNode. The client is also responsible for creating a checksum file that is saved in the same HDFS namespace. After the last file block is sent, the NameNode submits the file creation to its persistent metadata store (in the EditLog and FsImage files).

Linux cluster

The Hadoop framework can be used on a single Linux platform (during development and debugging), but it can only be used with commercial servers stored in racks. These racks make up a Hadoop cluster. It uses cluster topology knowledge to decide how to distribute jobs and files across the cluster. Hadoop assumes that a node may fail, so using a native approach to the failure of a single computer or even all racks.

Hadoop application

One of the most common uses of Hadoop is Web Search. Although it is not the only software framework application, its performance is outstanding as a parallel data processing engine. One of the most interesting aspects of Hadoop is the Map and Reduce process, which is inspired by Google's development. This process, called creating an index, takes the text Web pages retrieved by the Web crawler as input and reports the frequency of the words on those pages as a result. This result can then be used throughout the web search process to identify the content from the defined search parameters.

MapReduce

The simplest MapReduce application contains at least three parts: a Map function, a Reduce function and a main function. The main function combines job control with file input / output. At this point, Hadoop provides a large number of interfaces and abstract classes, giving Hadoop application developers many tools for debugging and performance metrics.

MapReduce itself is a software framework for parallel processing of big data sets. The roots of MapReduce are the map and reduce functions in functional programming. It consists of two operations that may contain many instances (many Map and Reduce). The Map function takes a set of data and transforms it into a list of key / value pairs, one key / value pair for each element in the input field. The Reduce function accepts lists generated by the Map function and narrows down the list of key / value pairs based on their keys, which generate a key / value pair for each key.

Here's an example to help you understand it. Suppose the input field is one small step for man, one giant leap for mankind. Running the Map function on this field will result in the following list of key / value pairs:

1. (one, 1) (small, 1) (for, 1) (man, 1) 2. (one, 1) (giant, 1) (leap, 1) (for, 1) (for mankind, 1)

If you apply the Reduce function to this list of key / value pairs, you get the following set of key / value pairs:

1. (one, 2) (small, 1) (for, 2) (man, 1) 2. (giant, 1) (leap, 1) (mankind, 1)

The result is a count of words in the input field, which is undoubtedly useful for working with indexes. However, now suppose you have two input fields, the first one is one small step for man and the second one is giant leap for mankind. You can execute the Map function and the Reduce function on each domain, and then apply the two key / value pairs to another Reduce function, where you get the same result as before. In other words, you can do the same thing in parallel with the input field, and you get the same result, but at a faster rate. This is the power of MapReduce; its parallelism can be used on any number of systems. Figure 2 demonstrates this idea in sections and iterations.

Figure 2. The conceptual flow of the MapReduce process

Now back to Hadoop, how does it implement this functionality? A MapReduce application started on behalf of a client on a single primary system is called a JobTracker. Like the NameNode, it is the only system in the Hadoop cluster that controls MapReduce applications. After the application is submitted, the input and output directories contained in HDFS will be provided. JobTracker uses file block information (physical quantities and locations) to determine how to create other TaskTracker dependent tasks. The MapReduce application is copied to each node where the input file block appears. A unique subordinate task will be created for each file block on a specific node. Each TaskTracker reports status and completion information to the JobTracker. Figure 3 shows the work distribution in an example cluster.

Figure 3. Hadoop cluster showing the physical distribution of processing and storage

This feature of Hadoop is important because it does not move the storage to a location for processing, but instead moves the processing to storage. This supports efficient data processing by adjusting the processing according to the number of nodes in the cluster.

Other Hadoop applications

Hadoop is a multi-purpose framework for developing distributed applications; looking at issues from different perspectives is a great way to get the most out of Hadoop. Looking back at Figure 2, that process takes the form of a staircase function where one component uses the result of another. Of course, it is not a panacea for development tools, but you can choose to use Hadoop if you run into problems.

Hadoop has been helping to solve a variety of problems, including sorting of very large data sets and searching for large files. It is also at the heart of various search engines, such as Amazon's A9 and the Able Grape vertical search engine for finding wine information. The Hadoop Wiki provides a list of a large number of applications and companies that use Hadoop in a variety of ways.

Currently, Yahoo! has the largest Hadoop Linux production architecture, composed of more than 10,000 cores, with more than 5PB of storage distributed across DataNodes. There are almost a trillion links within their web index. But you probably do not need that big system, and if that's the case, you can build a 20-node virtual cluster using Amazon Elastic Compute Cloud (EC2). In fact, the New York Times used Hadoop and EC2 to convert 4TB of TIFF images - including 405K large TIFF images, 3.3M SGML articles and 405K XML files - into 800K PNG images for use on the web in 36 hours. This process, called cloud computing, is a unique way of demonstrating the power of Hadoop.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use Linux and Hadoop for distributed computing

Contact Us

Recommend Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support