Distributed computing with Linux and Hadoop

Source: Internet
Author: User
Keywords Name dfs function
Tags apache application applications based block client clients company

Hadoop was formally introduced by the Apache Software Foundation Company in fall 2005 as part of the Lucene subproject Nutch. It was inspired by MapReduce and Google File System, which was first developed by Google Lab. In March 2006, MapReduce and Nutch distributed File System (NDFS) were included in projects called Hadoop.

Hadoop is the most popular tool for classifying search keywords on the Internet, but it can also solve many of the problems that require great scalability. For example, what would happen if you were to grep a 10TB mega file? On traditional systems, this will take a long time. But Hadoop is designed to take these issues into account, and thus greatly improve efficiency.

For

read more of Tim Jones's article on DeveloperWorks


Tim's Anatomy ... Series of articles Tim on developerWorks all articles

Hadoop is a software framework that enables distributed processing of large amounts of data. But Hadoop is handled in a reliable, efficient, scalable way. Hadoop is reliable because it assumes that the compute element and store will fail, so it maintains multiple copies of the work data, ensuring that the processing can be redistribution for failed nodes. Hadoop is efficient because it works in parallel and speeds up processing through parallel processing. Hadoop is also scalable to handle PB-level data. In addition, Hadoop relies on the community server, so its cost is low and can be used by anyone.

As you may have thought, Hadoop is ideal for running on a Linux production platform because it has a framework written in the Java™ language. Applications on Hadoop can also be written in other languages, such as C + +.




Hadoop Architecture

Hadoop is composed of many elements. At the bottom is the Hadoop Distributed File System (HDFS), which stores files on all storage nodes in the Hadoop cluster. The upper layer of HDFS (for this article) is the MapReduce engine, which consists of jobtrackers and tasktrackers.




HDFS

For external clients, HDFS is like a traditional hierarchical file system. You can create, delete, move, or rename files, and so on. But HDFS's architecture is built on a specific set of nodes (see Figure 1), which is determined by its own characteristics. These nodes include the Namenode (only one) that provides the metadata service within the HDFS; DataNode, which provides a storage block for HDFS. Because there is only one namenode, this is a disadvantage of HDFS (single point failure).




Figure 1. A simplified view of the Hadoop cluster





Files stored in HDFS are partitioned into blocks and then replicated to multiple computers (DataNode). This is very different from the traditional RAID architecture. The size of the block (typically 64MB) and the number of blocks copied are determined by the client when the file is created. Namenode can control all file operations. All communications within the HDFS are based on the standard TCP/IP protocol.

Namenode

Namenode is a software that is typically run on a separate machine in a HDFS instance. It is responsible for managing file system namespaces and controlling access by external clients. Namenode determines whether the file is mapped to a copy block on the DataNode. For the 3 most common blocks of replication, the first copy block is stored on different nodes in the same rack, and the last copy block is stored on a node in a different rack. Note that you need to understand the cluster architecture here.

The actual I/O transaction is not Namenode, only metadata representing DataNode and block file mappings is passed through Namenode. When an external client sends a request to create a file, Namenode responds with the block ID and the DataNode IP address of the first copy of the block. The Namenode also notifies other DataNode that will receive a copy of the block.

Namenode stores all information about the file System namespace in a file called Fsimage. This file and a record file containing all transactions (here is Editlog) will be stored on the Namenode local file system. Fsimage and Editlog files also require replicas to prevent file corruption or Namenode system loss.

DataNode

Namenode is also a software that normally runs on a separate machine in a HDFS instance. The Hadoop cluster contains a namenode and a large number of DataNode. DataNode are usually organized in a rack that connects all systems with a single switch. One assumption of Hadoop is that the transfer speed between the internal nodes of the rack is faster than that between the rack nodes.

DataNode responds to read and write requests from HDFS clients. They also respond to commands for creating, deleting, and copying blocks from Namenode. Namenode relies on periodic heartbeat (heartbeat) messages from each DataNode. Each message contains a block report that Namenode can validate block mappings and other file system metadata based on this report. If DataNode cannot send a heartbeat message, Namenode will take the repair action to replicate the missing blocks on the node.

File actions

Visible, HDFS is not a universal file system. Its primary purpose is to support streaming access to large files written. If the client wants to write the file to HDFS, it first needs to cache the file to a local temporary store. If the cached data is larger than the required HDFS block size, the request to create the file is sent to Namenode. Namenode will respond to the client with a DataNode identity and target block. Also notifies the DataNode that the file block copy will be saved. When the client begins to send a temporary file to the first DataNode, the block content is immediately forwarded to the replica DataNode by pipeline. The client is also responsible for creating a checksum (checksum) file that is saved in the same HDFS namespace. After the last block of files is sent, namenode the file to its persisted metadata store (in Editlog and Fsimage files).

Linux Cluster

The Hadoop framework can be used on a single Linux platform (when developing and debugging), but it can be exploited using a commercial server that is stored on a rack. These racks form a Hadoop cluster. It determines how jobs and files are allocated throughout the cluster through cluster topology knowledge. Hadoop assumes that a node may fail, so the native method of handling a single computer or even all of the racks fails.




Hadoop applications

One of the most common uses of Hadoop is Web search. Although it is not the only software framework application, it behaves very prominently as a parallel data processing engine. One of the most interesting aspects of Hadoop is the MAP and Reduce process, which is inspired by Google development. This process is called creating an index that retrieves the text Web page retrieved by the web crawler as input and reports the frequency of the words on those pages as the result. This result can then be used throughout the WEB search process to identify the content from the defined search parameters.

MapReduce

The simplest MapReduce application consists of at least 3 parts: A Map function, a Reduce function, and a main function. The main function combines job control with file input/output. At this point, Hadoop provides a number of interfaces and abstract classes, providing many tools for Hadoop application developers to debug and performance metrics.

MapReduce itself is a software framework for parallel processing of large datasets. The root of MapReduce is the map and reduce functions in functional programming. It consists of two operations that may contain many instances (many Map and Reduce). The MAP function takes a set of data and converts it to a list of key/value pairs, with each element in the input field corresponding to a key/value pair. The reduce function accepts the list generated by the MAP function and then shrinks the list of key/value pairs based on their keys, which generate a key/value pair for each key.

This provides an example to help you understand it. Suppose that the input field is one Sgt step for the man, a giant leap for mankind. Running the MAP function on this domain will draw the following list of key/value pairs:

(One, 1) (SGT, 1) (step, 1) (for, 1) (Man, 1) (One, 1) (Giant, 1) (Leap, 1) (for, 1) (Mankind, 1)


If you apply the Reduce function to this key/value pair list, you get the following set of key/value pairs:

(one, 2) (SGT, 1) (step, 1) (for, 2) (Mans, 1) (Giant, 1) (Leap, 1) (mankind, 1)


The result is a count of the words in the input field, which is certainly useful for processing indexes. However, now suppose there are two input fields, the first is one Sgt step, and the second is one giant leap for mankind. You can perform the MAP function and the reduce function on each domain, and then apply the two key/value pairs to the other reduce function, and you get the same result as before. In other words, you can use the same operations in the input field in parallel, and the results are the same, but faster. This is the power of MapReduce; its parallel functionality can be used on any number of systems. Figure 2 illustrates this idea in the form of sections and iterations.




Figure 2. Concept flow
of MapReduce flow




Now back to Hadoop, how does it implement this functionality? A MapReduce application that represents a client that starts on a single master system is called a jobtracker. Similar to Namenode, it is the only system in the Hadoop cluster that is responsible for controlling the MapReduce application. After the application is submitted, the input and output directories included in HDFS are provided. Jobtracker uses file block information (physical quantity and location) to determine how to create additional tasktracker subordinate tasks. The MapReduce application is replicated to each node that appears in the input file block. A unique subordinate task is created for each block of files on a particular node. Each Tasktracker reports status and completion information to Jobtracker. Figure 3 shows the distribution of the work in a sample cluster.




Figure 3. Hadoop cluster showing the physical distribution of processing and storage





This feature of Hadoop is important because it does not move storage to a location for processing, but rather moves processing to storage. This enables efficient data processing by adjusting processing based on the number of nodes in the cluster.




Other applications of Hadoop

Hadoop is a versatile framework for developing distributed applications, and looking at problems from a different perspective is a good way to make the most of Hadoop. Looking back at Figure 2, that process appears as a step function, where one component uses the results of another component. Of course, it's not a one-size-fits-all development tool, but if this is the case, then you can choose to use Hadoop.

Hadoop has been helping to solve various problems, including sorting large datasets and searching for large files. It is also at the heart of a variety of search engines, such as Amazon's A9 and the Inc. Grape vertical search engine used to find wine information. The Hadoop Wiki provides a list of numerous applications and companies that use Hadoop in a variety of ways (see Resources).

Currently, Yahoo! has the largest Hadoop Linux production architecture, consisting of more than 10,000 cores, with over 5PB bytes of storage distributed to each DataNode. There are almost 1 trillion links inside their Web index. However, you may not need such a large system, and if so, you can build a virtual cluster of 20 nodes using Amazon elastic Compute Cloud (EC2). In fact, the New York Times uses Hadoop and EC2 to convert 4TB of TIFF images-including 405K large TIFF images, 3.3M SGML articles and 405K XML files-to a PNG image that is suitable for use on the Web within 36 hours. This process, called cloud computing, is a unique way to show the power of Hadoop.

There is no doubt that Hadoop is becoming more and more powerful. From the application that uses it, its future is bright.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.