── Introduction to distributed computing open-source framework hadoop (I)
During the design of the SIP project, we considered using the multi-thread processing mode of task decomposition to analyze statistics for its large logs at the beginning.ArticleTiger concurrent practice-parallel log analysis design and implementation. However, the statistics are still very simple for the time being, so memcache is used as the counter, and access control and statistics are completed in conjunction with MySQL. However, in the future, we still need to prepare for massive log analysis. Today, the most popular technology term is "cloud computing". As open APIs become increasingly popular today, data for Internet applications will become more and more valuable. How can we analyze the data and explore its internal value, distributed Computing is required to support the analysis of massive data.
Looking back, the log analysis design of multithreading and multi-task decomposition in the past is actually a single-host scaling of distributed computing. How can we split this single-host job into a collaborative cluster, it is actually involved in the distributed computing framework design. At the BEA conference last year, Bea and VMWare cooperated to build clusters using virtual machines, hoping that computer hardware could be similar to applications.ProgramThe user does not need to care about resource allocation, thus maximizing the value of hardware resources. The same is true for Distributed Computing. Which machine is responsible for executing a specific computing task and who will summarize it after execution is selected by the master of the distributed framework, users can obtain the results of distributed computing simply by providing the content to be analyzed to the distributed computing system as input.
Hadoop is an open-source distributed computing framework of Apache open-source organization. It has been applied to many large websites, such as Amazon, Facebook, and Yahoo. For me, a recent use point is log analysis on the service integration platform. The log volume of the service integration platform will be large, which is exactly in line with the applicable scenarios of distributed computing (Log Analysis and index creation are two major application scenarios ).
Currently, I am not officially determined to use it, so I am also exploring it myself. The content I will write later is a learning process for beginners, and there will inevitably be some errors, I just want to record it and share it with more like-minded friends.
What is hadoop?
Before doing something, the first step is to know what, then why, and finally how ). However, after many years of project development, many developers get used to how first, then what, and finally why. This will only make them impetuous, at the same time, technologies are often misused in unsuitable scenarios.
The core designs in the hadoop framework are mapreduce and HDFS. The idea of mapreduce is widely circulated as mentioned in a Google paper. A simple sentence explains mapreduce as a "Summary of task decomposition and results ". HDFS is short for hadoop Distributed File System, which provides underlying support for Distributed Computing and storage.
From the perspective of its name, mapreduce roughly shows the reason. Two vertices map and reduce, "map (expand)" means to break a task into multiple tasks, "reduce" is to summarize the results of multi-task processing after decomposition to obtain the final analysis result. This is not a new idea. In fact, with the multithreading mentioned above, the multi-task design can find the shadows of this idea. Whether in the real world or in programming, a job can often be split into multiple tasks. The relationship between tasks can be divided into two types: one is unrelated tasks, it can be executed in parallel. The other is that tasks are dependent on each other, and the sequence cannot be reversed. Such tasks cannot be processed in parallel. Back in college, when teaching, let everyone analyze the Key Path, simply find the most time-saving task decomposition and execution method. In a distributed system, machine clusters can be seen as a hardware resource pool, which splits parallel tasks and submits them to every idle machine resource for processing, greatly improving computing efficiency, at the same time, this resource independence undoubtedly provides the best design guarantee for the expansion of computing clusters. (In fact, I always think that the cartoon icon of hadoop should not be a small elephant, it should be ant, distributed computing is like ant eating elephant, cheap machine groups can rival any high-performance computer, the vertical expansion curve is always inferior to the diagonal line of horizontal expansion ). After the task is decomposed and processed, it is necessary to summarize the processed results, which is the task of reduce.
Figure 1: mapreduce Structure
It is the general structure of mapreduce. Before map, the input data may be split to ensure the parallel efficiency of the task. After map, there will be shuffle (hybrid) to improve the efficiency of reduce and reduce the pressure on data transmission. The details of these parts will be detailed later.
HDFS is the cornerstone of distributed computing. hadoop's distributed file system has many similar characteristics as other distributed file systems. Basic Features of a distributed file system:
- A single namespace is available for the entire cluster.
- Data consistency. This model is suitable for writing multiple reads at a time. The client cannot see the file before the file is successfully created.
- The file is divided into multiple file blocks, and each file block is allocated and stored on the data node. According to the configuration, the file block is copied to ensure data security.
Figure 2: HDFS Structure
Shows the three important roles of HDFS: namenode, datanode and client. Namenode can be viewed as a manager in a distributed file system. It is mainly responsible for managing File System namespaces, cluster configuration information, and storage block replication. Namenode stores the meta-data of the file system in the memory. The information mainly includes the file information, the information of the file block corresponding to each file, and the information of each file block in datanode. Datanode is the basic unit of file storage. It stores blocks in the local file system, stores block meta-data, and periodically sends all existing block information to namenode. The client is the application that needs to obtain distributed file system files. Here, we use three operations to describe the interaction between them.
File writing:
- The client initiates a file write request to the namenode.
- Namenode returns the information of the datanode managed by the client based on the file size and file block configuration.
- The client divides the file into multiple blocks and writes them to each datanode block in sequence based on the datanode address information.
File Reading:
- The client initiates a File Read Request to the namenode.
- Namenode returns the datanode information of the NAs.
- The client reads the file information.
File block replication:
- Namenode found that the block of some files does not meet the minimum number of copies or some datanode is invalid.
- Notify datanode to copy blocks to each other.
- Datanode starts to directly Replicate each other.
Finally, let's talk about several HDFS design features (for the framework design, it is worth learning ):
-
- Block placement: not configured by default. A block will have three backups, one for the specified datanode in namenode, and the other for the datanode that is not the same as the specified datanode in rack, the last copy is placed on the datanode on the same rack as the specified datanode. Backup is nothing more than for data security. This configuration method is used to consider the failure of the same rack and the performance of data copying between different rack.
-
- Heartbeat checks the health status of datanode. If any problem is found, data backup is used to ensure data security.
- Data Replication (in scenarios where datanode fails, the storage utilization of datanode needs to be balanced, and the Data Interaction pressure needs to be balanced, you can configure a threshold to balance the utilization of each datanode disk. For example, if threshold is set to 10%, when the balancer command is executed, calculate the average disk utilization of all datanode, and then determine if the disk utilization of a datanode exceeds the average value threshold, the block of datanode will be transferred to the datanode with low disk utilization, which is very useful for adding new nodes.
-
- Data delivery: CRC32 is used for data delivery. When writing a file block, in addition to writing data, it also writes the verification information. When reading the file block, you need to submit the verification before reading it.
-
- Namenode is a single point: if the task fails, the task processing information will be recorded in the local file system and the remote file system.
-
- Data Pipeline writing: when the client needs to write a file to datanode, the client first reads a block and then writes it to the first datanode, and then passes the first datanode to the backup datanode, the client will continue to write the next block until all natanode that needs to be written to this block is successfully written.
-
- Security Mode: When the Distributed File System is started, the security mode starts. When the Distributed File System is in security mode, the content in the file system cannot be modified or deleted until the security mode ends. The security mode is mainly used to check the validity of data blocks on each datanode when the system starts, and to copy or delete some data blocks as required by the policy. You can also enter security mode through commands during runtime. In practice, when the system starts, an error message indicating that the file cannot be modified in security mode is displayed when you modify or delete the file. You only need to wait for a while.
The following describes the hadoop structure based on mapreduce and HDFS:
Figure 3: hadoop Structure
In the hadoop system, there will be a master responsible for namenode and jobtracker. Jobtracker is mainly responsible for starting, tracking, and scheduling slave tasks. There will also be multiple slave, each slave usually has the datanode function and is responsible for tasktracker work. Tasktracker executes map tasks and reduce tasks based on local data according to application requirements.
Speaking of this, we should mention the most important design point of distributed computing: Moving computation is cheaper than moving data. In distributed processing, the cost of mobile data is always higher than that of transfer computing. In short, it is a divide-and-conquer job. data needs to be stored separately, local tasks process local data, and then summarized to ensure the efficiency of distributed computing.
Why hadoop?
After what is done, let's briefly talk about why. The official website has provided a lot of instructions. Here we will give a general introduction to its advantages and use cases (no bad tools, only unsuitable tools, therefore, a good scenario can truly play the role of distributed computing ):
- Scalability: Both storage scalability and computing scalability are fundamental to hadoop's design.
- Economic: the framework can run on any common PC.
- Reliability: the Backup recovery mechanism of the Distributed File System and mapreduce task monitoring ensure the reliability of distributed processing.
- Efficient: the efficient data interaction Implementation of the Distributed File System and mapreduce combined with the local data processing mode provide basic preparation for efficient processing of massive amounts of information.
Application Scenario: In my opinion, the most suitable analysis is massive data. In fact, Google first proposed mapreduce to analyze massive data. At the same time, HDFS was first developed for the implementation of search engines and later used in distributed computing frameworks. Massive Data is divided into multiple nodes, and then each node performs parallel computing to merge the results to the output. At the same time, the output of the first stage can be used as the input for computing in the next stage. Therefore, we can imagine a distributed computing graph with a tree structure, which has different outputs at different stages, parallel and serial computing can be effectively processed under the resources of distributed clusters.