With the development of our network, the development of science and technology, our network generated more and more data, more and more big, so big to what extent, 10000G so big? Tell you more than that! Is it 1000T that big? More than that, is already a single server can not solve, So we can not use more than a few servers to solve it? It is cumbersome to write and read data separately for a single server.
Then the distributed file system came into being, it can govern many servers to http://www.aliyun.com/zixun/aggregation/17326.html "> Storage data, through this file system to store data, We feel like we're working on a server. The Distributed File system manages a server cluster. In this cluster, the data is stored in the nodes of the cluster (that is, the servers in the cluster). But the file system to the server's differences to shield, but the data distributed in different servers, the distribution of data on different nodes may belong to the same file, in order to organize a large number of files, the file into a different folder, Folders can be included at the first level. This form of organization is called a namespace (namespace). Namespaces manage all the files in the cluster. The responsibilities of namespaces are different from those that actually store real data. The node responsible for the namespace is called the Master node, and the responsibility for storing the real data is called the slave node. The master node is responsible for managing the file structure of the filesystem and storing the real data from the node, which we call the master-slave structure (master-slave). User actions should also deal with the master node first, querying which data is stored from the node before it is stored from the node and then read from the node. In the master node, in order to speed up the user's access, the entire namespace information is placed in memory, and the more files are stored, the more memory space is required for the master node. When storing data from a node, some raw data may be large, some may be very small, if the size of the file is not easy to manage, then abstract a separate storage file unit, called: block. Then the data is stored in the cluster, possibly network reason or the server hardware reason causes the access to fail, therefore again only then uses the replica mechanism (replication), backs up the data simultaneously to many servers, thus the data is more secure.
In Hadoop, a distributed storage system is collectively known as HDFS (Hadoop Distributed File System). Where the primary node is the name node (namenode), which is called the Data node (Datanode) from the node.
When processing the data, we read the data into memory for processing. If we are processing the massive data in memory, such as the size of the data is 100GB, we want to count how many words there are in the file. It is almost impossible to think of a file loaded into memory, and it is almost impossible to load the data into memory. With the development of technology, even if the server has 100GB of memory, such a server is also very expensive, even if the data can be loaded into the content, then it will take a long time to load 100GB. So this is the problem we have, so how do we deal with it?
Can I put the program code on the server where the data is stored? Because the program code is small and almost negligible relative to the original data, it saves the time of the original data transfer. Now that the data is stored in a distributed file system, 100GB of data may be stored on many servers, the code can be distributed to these servers, on these servers at the same time, that is, parallel computing, which greatly shortens the execution time of the program. Distributed computing requires the final result, and the program code executes many results on many servers, so a piece of code is needed to summarize the intermediate results. The distributed computing in Hadoop is generally divided into two stages, the first stage is responsible for reading the original data in each data node, to conduct preliminary processing, the data of each node for the word book. The processing results are then transferred to the second stage, and the intermediate results are aggregated to produce the final result.
In distributed computing, program code should allow the data nodes to be placed on, which nodes run the first stage of code, which nodes run the second phase of the code, and after the first stage, the code is transferred to the node where the second stage code is executed, and what if the intermediate execution fails? And so on, all need to be managed. The node running these administrative responsibility codes is called the Master node, and the node running phase 12th program code is called the slave node. The user's code should be submitted to the master node, which is responsible for assigning the code to a different node for execution.
In Hadoop, the Distributed computing section is called MapReduce. Where the master node is called the Job node (jobtracker), the node is called from the node (tasktracker). In the task node, the code that runs the first paragraph is called the map task, and running the second code is called the reduce task.
Original link: http://my.oschina.net/u/1464779/blog/285801
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.