The separation of this single work into a collaborative cluster, which is the design of distributed computing framework. So that the computer hardware is similar to the resources in the application resource pool, the user does not care about the allocation of resources, thereby maximizing the use value of hardware resources. Distributed computing is also the case, the specific computing tasks to which machine execution, after execution by WHO to summarize, this is the distributed framework of the master to choose, and the user simply to provide the content to be analyzed by the distributed computing system as input, you can get the results of distributed computing.
What is Hadoop?
The most central design in the Hadoop framework is mapreduce and HDFs. MapReduce is "the decomposition of tasks and the summary of results". HDFS is the abbreviation for the Hadoop Distributed File System (Hadoop distributed), which provides the underlying support for distributed compute storage.
Map (unfold) "is to decompose a task into multiple tasks," Reduce "is to summarize the results of the multi-task processing after decomposition, to obtain the final analysis results."
Figure 1:mapreduce Structure
The process of Shuffle (blending) ensures that the input of each reducer is sorted by key, and the system performs the sequencing process, which is helpful for improving the efficiency of reduce and reducing the pressure of data transmission .
HDFs is a storage cornerstone for distributed computing, and Hadoop's Distributed file systems and other Distributed file systems have many of the same traits. Some basic features of Distributed File system:
- There is a single namespace for the entire cluster.
- Data consistency. For models that write to multiple reads at once, the client cannot see the file exists until the file has been successfully created.
- The files are divided into chunks, each of which is allocated to the data node, and the data is secured by copying the file blocks according to the configuration.
Figure 2:HDFS Structure
Shows three important roles throughout HDFs: NameNode, Datanode, and client.
Namenode can be regarded as the manager of Distributed File system, which is mainly responsible for managing the namespace of file system, cluster configuration information and storage block replication. Namenode will store the file system's Meta-data in memory, which mainly includes the file information, the corresponding file block of each file, and the information of each file block in Datanode.
Datanode is the basic unit of file storage, which stores blocks in the local file system, preserves block meta-data, and periodically sends all existing block information to Namenode.
The client is the application that needs to obtain the Distributed File System files. There are three actions to illustrate the interaction between them.
File write:
- The client initiates a request to the Namenode to write the file.
- Namenode returns information to the client that it manages partially Datanode, based on file size and file block configuration.
- The client divides the file into blocks, which are written sequentially to each Datanode block according to the Datanode address information.
File read:
- The client initiates a request to the Namenode to read the file.
- Namenode returns the Datanode information for the file store.
- The client reads the file information.
File Block replication:
- Namenode found that the block of some files does not meet the minimum number of copies or partial datanode failure.
- Notify Datanode to duplicate each block.
- Datanode began to replicate directly with each other.
Finally, let's talk about some of the design features of HDFs (for frame design it is worth learning):
- Block Placement: Default is not configured. A block will have three copies, one placed on the datanode specified by Namenode, and the other on Datanode with the specified Datanode not the same rack, and the last copy on Datanode on the same rack as the specified datanode. Backup is just for data security, consider the failure of the same rack and the performance of data copying between different rack to use this configuration.
- Heartbeat detection Datanode health status , if the problem is found to take data backup way to ensure the security of data.
- data replication (scenarios are datanode failures, need to balance datanode storage utilization and need to balance datanode data interaction pressure, etc. ): Here we go, using the HDFs balancer command, You can configure a threshold to balance the utilization of each datanode disk. For example, if the threshold is set to 10%, then when executing the balancer command, first count the average of all datanode disk utilization, and then determine if a Datanode disk utilization exceeds this mean threshold above, The Datanode block will then be transferred to a low-disk-utilization datanode, which is useful for joining new nodes.
- Data inspection: Using CRC32 for data inspection. When the file block is written in addition to writing the data will be written to the inspection information, in the read time need to be checked and then read.
- Namenode is a single point : If it fails, the task processing information will be recorded in the local file system and the remote file system.
- Data pipelining Write : When the client is writing to a file on Datanode, the client first reads a block and then writes to the first Datanode, then the first Datanode is passed to the backup Datanode. Until all natanode that need to write to the block have been written successfully, the client will continue to start writing down a block.
- Safe Mode : When the Distributed File system starts, there will be a security mode at the beginning, and when the Distributed file system is in Safe mode, the contents of the file system are not allowed to be modified or deleted until the end of safe mode. The Safe mode is to check the validity of the data blocks on each datanode when the system is started, and to copy or delete some data blocks according to the policy. The run time can also be entered in safe mode through commands. In practice, when the system starts to modify and delete files will also have a safe mode does not allow the modification of the error prompt, only need to wait a while.
In the Hadoop system, there will be a master, primarily responsible for Namenode's work and Jobtracker's work.
Jobtracker's primary responsibility is to initiate, track, and dispatch the task execution of each slave. There will also be more than one slave, each slave usually has datanode function and is responsible for Tasktracker's work. Tasktracker performs map tasks and reduce tasks in conjunction with local data according to application requirements.
In this case, mention is made of the most important design point for distributed computing:moving computation is cheaper than moving Data. in distributed processing, the cost of moving data is always higher than the cost of transfer computation. simply put, divide and conquer the work, need to divide and store data, local task processing local data and then which comes, this will ensure the efficiency of distributed computing.
hadoop1.x principle