In the SIP project design process, for its huge log at the beginning to consider using the task decomposition of multithreading mode to analyze statistics, in my previous article "Tiger Concurrent Practice-Log analysis of parallel decomposition design and implementation" mentioned. But because the content of statistics is still very simple, so the use of memcache as a counter, combined with MySQL to complete the access control and statistics work. However, in the future, the work of massive log analysis needs to be prepared. Now the most fire technical vocabulary is "cloud computing", in the open API increasingly popular today, the Internet application data will be more and more valuable, how to analyze these data, mining its intrinsic value, need distributed computing to support the analysis of massive data.
Looking back, the earlier kind of multithreading, multi-task decomposition of the log analysis design, in fact, is a stand-alone version of the calculation of the abbreviation, how to split the work of this stand-alone, into a collaborative work cluster, in fact, is the distributed computing framework design involved. At the BEA conference last year, BEA and VMware collaborated on virtual machines to build clusters, in the hope that the computer hardware would be similar to the resource pools in the application, and that users would not have to care about the allocation of resources to maximize the value of their hardware resources. Distributed computing is also the case, the specific computing task to which machine to execute, after the implementation by WHO to summarize, this is the distributed framework of the master to choose, and users simply to analyze the content to provide to the distributed computing system as input, you can get the results of distributed computing.
Hadoop is a distributed computing open source framework for the Apache open source organization that has been applied to many large web sites, such as Amazon, Facebook and Yahoo. For me, one of the most recent usage points is the log analysis of the service integration platform. The service integration platform's log volume will be very large, and this also coincides with the application of distributed computing scenarios (log analysis and indexing is the two major scenarios).
Currently there is no formal use, so it is their own amateur groping, follow-up written related content, are a novice learning process, there will inevitably be some mistakes, just want to record down can share to more like-minded friends.
What is Hadoop?
Before doing anything, the first step is to know what (what), then why (why), and finally how (what to do). But a lot of development friends after many years of project, are accustomed to first how, then what, finally is why, this will only let oneself become impetuous, and often will be the technology mistakenly used in unsuitable scenes.
The central design of the Hadoop framework is: MapReduce and HDFs. MapReduce's ideas are widely circulated by a Google paper, and a simple word explains that MapReduce is "the breakdown of tasks and the aggregation of results." HDFS is the acronym for the Hadoop Distributed File System (Hadoop Distributed File systems), which provides low-level support for distributed computing storage.
MapReduce from its name to see the approximate reason, two verbs map and Reduce, "map (expand)" is to break a task into multiple tasks, "Reduce" is the decomposition of multitasking results aggregated, to get the final analysis results. This is not a new idea, in fact, in the previous mentioned multithreading, multitasking design can find the shadow of this thought. Whether in the real world or in programming, a job can be split into multiple tasks, the relationship between tasks can be divided into two: one is unrelated to the task, can be executed in parallel, the other is the task is interdependent, the order can not be reversed, such tasks can not be processed in parallel. Back in college, the professor to let everyone to analyze the critical path, nothing more than to find the most time-saving task decomposition execution mode. In the distributed system, the machine cluster can be regarded as the hardware resource pool, and the parallel task is split, then left to each idle machine resources to deal with, can greatly improve the computational efficiency, at the same time, this resource-independent, for the expansion of computing clusters undoubtedly provides the best design guarantee. (In fact, I always think that the cartoon of Hadoop should not be a small elephant, should be ants, distributed computing is like ants eat elephants, cheap machine groups can match any high-performance computer, the longitudinal expansion of the curve is always the enemy but the horizontal extension of the diagonal). After the task is decomposed, it is necessary to consolidate the results of the processing, which is what reduce does.
Figure 1:mapreduce Structure schematic diagram
The above diagram is the MapReduce structure diagram, before the map may also have the input data has the split (split) process, guarantees the task parallel efficiency, after the map also has the shuffle (mixing) the process, It helps to increase the efficiency of reduce and reduce the pressure of data transfer. Details of these sections will be specifically mentioned later.
HDFs is the storage cornerstone of distributed computing, and Hadoop's Distributed file system and other Distributed file systems have many similar qualities. Some basic features of Distributed File system:
There is a single namespace for the entire cluster.
Data consistency. A model that is suitable for writing multiple reads at a time, and the client cannot see the file before it is successfully created.
The files are split into multiple file blocks, each of which is allocated to the data node, and the data is secured by the copy file block, depending on the configuration.
Figure 2:HDFS Structure schematic diagram
The above illustration shows the entire HDFs three important roles: Namenode, Datanode, and client. Namenode can be considered as a manager in a distributed file system, mainly responsible for managing file system namespaces, cluster configuration information, and storage block replication. Namenode stores the file system Meta-data in memory, which includes file information, information about each file block, and Datanode information for each file block. Datanode is the basic unit of file storage, which stores blocks in local file systems, preserves block meta-data, and sends all existing block information to Namenode periodically. The client is the application that needs to obtain the Distributed File System files. Here are three actions to illustrate the interaction between them.
File write:
Client requests to initiate file writes to Namenode.
Namenode returns information about the portion of Datanode that the client manages, based on the file size and file block configuration.
The client divides the file into blocks and writes to each Datanode block sequentially according to the Datanode address information.
File reads:
Client requests to Namenode to initiate file reads.
Namenode returns the Datanode information for the file store.
The client reads the file information.
File Block replication:
Namenode found that part of the file block does not meet the minimum number of copies or some datanode invalid.
Notifies the Datanode to copy blocks to each other.
Datanode began to replicate directly with each other.
Finally, let's talk about some of the design features of HDFs (for frame design is worth learning):
Block Placement: Default is not configured. One block will have three copies, one on the datanode specified in Namenode and the other on Datanode on the same rack as the designated Datanode, and the last on Datanode on the same rack as the designated Datanode. Backup is simply for data security, considering the failure of the same rack and the problem of data copy performance between different rack.
Heartbeat detection Datanode health status, if the problem is found to take the way of data backup to ensure data security.
Data replication (scenario for Datanode failure, need to balance datanode storage utilization, and need to balance datanode data interaction pressure): Let's say first, using the HDFs balancer command, You can configure a threshold to balance each datanode disk utilization. For example, if you set the threshold to 10%, then when you execute the Balancer command, you first count the average of the disk utilization of all Datanode, and then determine if the disk utilization of one datanode exceeds this mean threshold above, The Datanode block will then be transferred to a datanode with low disk utilization, which is useful for new nodes to join.
Data inspection: Using CRC32 data for inspection. When the file block is written, in addition to writing the data will be written to the inspection information, in the reading of the need to check and then read.
Namenode is a single point: if it fails, the task processing information will be recorded in the local file system and the remote file system.
Data pipeline writes: When a client writes a file to Datanode, the client reads a block and writes it to the first Datanode, then the first Datanode is passed to the datanode of the backup, Until all natanode that need to write to this block are written successfully, the client will continue to write the next block.
Safe mode: When the Distributed File system is started, there will be safe mode at the beginning, and when the Distributed file system is in Safe mode, the contents of the file system are not allowed to be modified or deleted until Safe mode is finished. The main purpose of security mode is to check the validity of the data blocks on each datanode when the system is started, and to copy or delete some pieces of data according to the necessary policies. The runtime can also enter Safe Mode by command. In practice, when the system starts to modify and delete files there will be Safe mode does not allow modification of error prompts, only need to wait a while.
The following combines MapReduce and HDFs to see the structure of Hadoop:
Figure 3:hadoop schematic diagram in Hadoop's system, there will be a master, primarily responsible for Namenode's work and Jobtracker's work. The primary responsibility of Jobtracker is to initiate, track, and dispatch the task execution of each slave. There will also be multiple slave, each slave usually has datanode functions and is responsible for tasktracker work. Tasktracker performs map tasks and reduce tasks based on application requirements in conjunction with local data.
The most important design point for distributed computing is mentioned here: moving computation is cheaper than moving Data. It is in distributed processing that the cost of moving data is always higher than the cost of the transfer calculation. In simple terms, divide and conquer work, the need to separate the data and storage, local task processing local data and then aggregation, so that the efficiency of distributed computing.
Why do you choose Hadoop?
Finish what, simply say why. The official website has given a lot of instructions, here is a general description of its advantages and the use of the scene (no bad tools, only the use of inappropriate tools, so choose a good scene can really play the role of distributed computing):
Scalable: Both storage scalability and computational scalability are fundamental to the design of Hadoop.
Economy: The framework can be run on any ordinary PC.
Reliability: The backup and recovery mechanism of distributed File system and MapReduce task monitoring ensure the reliability of distributed processing.
High efficiency: the efficient data interactive implementation of distributed file system and the model of MapReduce combined with local data processing are the basis for efficient processing of massive information.
The use of the scene: personally think the most suitable is the massive data analysis, in fact, Google first proposed MapReduce is to the massive data analysis. At the same time, HDFs was first developed for search engine implementation and later used in distributed computing frameworks. The mass data is divided into several nodes, then each node is computed in parallel, and the result is merged into the output. At the same time, the output of the first stage can be used as the input of the next stage, so it can be imagined that a distributed graph of the tree structure has different outputs at different stages, while the parallel and serial computing can be efficiently processed in the distributed cluster resources.
ZZ from:http://blog.csdn.net/21aspnet/article/details/6620885