The core idea of Hadoop

Source: Internet
Author: User
Keywords Server very then we can

Hadoop includes two core, http://www.aliyun.com/zixun/aggregation/14305.html "> Distributed Storage Systems and distributed computing system."

1.1.1.1. Distributed storage

Why does the data need to be stored in a distributed system, and can't a single computer be stored, and how many terabytes of hard drives today don't fit that data? Actually, it doesn't fit. For example, many of the telecom phone records are stored on many of the servers in many hard drives. So, to deal with so much data, must be from a single server read data and write data, too troublesome!

We want a file system that can govern many servers for storing data. Storing data through this filesystem does not feel like being stored on a different server. When reading data, it is not felt to be read from a different server. As shown in Figure 2-1. This is the Distributed file system.

Figure 2-1

The Distributed File system manages a server cluster. In this cluster, the data is stored in the cluster node (that is, the server in the cluster), but the file system masks the differences in the server. So, we can use the same as normal file systems, but the data is scattered across different servers.

In a distributed storage system, the data dispersed in different nodes may belong to the same file, in order to organize a large number of files, the file can be placed in different folders, folders can be included at one level. We refer to this form of organization as a namespace (namespace). Namespaces manage all the files in the entire server cluster. Clearly, the responsibilities of namespaces are not the same as the responsibility for storing real data. Different nodes in the cluster assume different responsibilities. The node responsible for the namespace is called the Master node, and the node responsible for storing the real data responsibility is called the slave node. The Master is responsible for managing the file structure of the file system and storing the real data from the node, called the master-slave structure (master-slaves). When you work with a user, you should also deal with the master node first, query which data is stored from the node, and then read from the node, as shown in Figure 2-2. At the master node, in order to speed up user access, the entire namespace information is placed in memory, and the more files are stored, the more memory space is required for the master node. When storing data from a node, some of the original data files may be large, some may be small, the size of the file is not easy to manage, you can abstract a separate storage file unit, called block. Data stored in the cluster, may be due to network reasons or server hardware causes of access failure, preferably with a copy (replication) mechanism, the data back to multiple servers, so that the data is safe, data loss or access to the probability of failure is small.

Figure 2-2

In the above master-slave structure, because the master node contains the entire file system directory structure information, because it is very important. In addition, because the primary node runs the namespace information into memory, the more files are stored, the more memory the master node needs.

In Hadoop, a distributed storage system is called HDFS (Hadoop Distributed File System). Where the master node is called the name Node (namenode), the node is called the Data node (datanode).

1.1.1.2. Distributed computing

When processing the data, we read the data into memory for processing. If we deal with massive data, such as the size of the data is 100GB, we want to count how many words there are in the file. It is almost impossible to load data into memory, called moving data. With the development of technology, even if the server has 100GB of memory, then the server price will be very high, not ordinary people can digest. Loading this 100GB of data into memory takes a long time, even if the data can be loaded into memory. All these problems are scratching our handle on big data. This means that the processing of the mobile computation is not suitable for large data calculations.

In other words, since the mobile data is not suitable, then can you put the program code on the server where the data is stored? Because the program code is usually small and almost negligible compared to the original data, it saves the original time for data transfer. Now that the data is stored in a distributed file system and 100GB of data may be stored on a number of servers, the program code can be distributed to these servers and executed concurrently on those servers, which is parallel computing and distributed computing. This greatly shortens the execution time of the program. The way we move the program code to the machine on the data node is called the motion calculation.

Distributed computing requires the final result, and the program code is executed in parallel on many machines, resulting in a lot of results, so a piece of code is needed to summarize the intermediate results. Distributed computing in Hadoop is typically done in two phases. The first stage is responsible for reading the raw data in each data node, preliminary processing, the data in each node to find the number of words. The processing results are then transferred to the second stage, the intermediate results are summarized, the final results are obtained, and the total number of words in the 100GB file is shown, as illustrated in Figure 2-3.

Figure 2-3

In distributed computing, program code should allow which data nodes, which nodes run the first phase of the code, which nodes run the second phase of the code, the first phase of the code after the execution, transfer to the second phase of the code in the node; What if the middle execution fails? And so on, all need to be managed. The node running these administrative responsibility codes is called the Master node, and the node running phase 12th program code is called the slave node. The user's code should be submitted to the master node, which is responsible for assigning the code to a different node for execution.

In Hadoop, the Distributed computing section is called MapReduce. Where the master node is called the Job node (jobtacker), the node is called a task node (tasktracker). In a task node, the code running the first phase is called the map task, and the second-stage code is called the reduce task.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.