What is meta-data? The explanation of the Baidu Encyclopedia is that the data describing the data, which is primarily descriptive of the data properties, is used to support functions such as indicating storage location, historical data, resource lookups, and file records. Metadata is an electronic catalogue, in order to achieve the purpose of compiling a directory, it is necessary to describe and collect the content or characteristics of the data, and then achieve the purpose of assisting the data retrieval. To say so much, to put it simply, is to manage the data.
There are two roles in Hadoop, Namenode (one master node), Datanode (multiple slave nodes), Datanode mainly store data, Namenode One is the metadata information that manages file system files (including file name, size, location, attributes, creation time, Modification time, and so on), the second is to maintain file-to-block correspondence and block-to-node correspondence, the third is to maintain user information on the operation of the file (file additions and deletions).
Now let's assume that if the metadata is stored only as a file in the Namenode local hard disk, that's OK. Because a large number of clients at the same time in the upload, download and other operations, the meta-data to read and write and modify the operation, only in the form of files to store metadata is obviously not, because it can not be a quick response to various operations, the metadata in memory, it does improve the system response speed, But once the power outage is completely lost, this certainly does not, then if the memory of the regular flush of data to the disk file method line does not. Once the power outage, no time to brush to disk memory data must also be lost, obviously not, then in the real world, how does Hadoop manage metadata?
First of all, the disk does have block space, the metadata is persisted storage, named Fsimage, if directly read the disk file, the speed must not keep up, memory also put some metadata information, although it is easy to lose, but can provide query services, is actually read and write separation, The problem of data consistency is separated by read and write, because the data is written, not written in memory, where is the latest metadata record. is actually recorded in a very small file, this file does not provide modification, only provide append, log in the form of records, has been maintained a few 10 megabytes, called Edits***.log, for example, when uploading a file, the first query Namenode, where to write, Namenode side of the allocation of records, the spatial distribution of information records Edits**.log, when the completion of a copy of the work, notify Namenode, is considered to be successful, when the Edits**.log data is updated to memory, at this time, the data in memory is the latest, Even if power is present, the latest metadata is saved in Edits**.log.
Look back at the process
1, when the client writes the file, Namenode first records the metadata operation to the Edits**.log file
2, the client began to upload files, the completion of the success of the information to Namenode,namenode in memory to write this upload operation of the newly generated metadata information, Edits**.log file size has a certain range, smaller, fsimage file is the memory of the image file, Fsimage is the most complete, Edits**.log is the latest, the update order is first Edits**.log, followed by memory, and finally fsimage, that fsimage when the update, memory and fsimage how to maintain consistency. As long as the edits**.log does not need to be synchronized when it is not full, the check point operation here refers to the fact that whenever Edits**.log is full, the new metadata for this period of time needs to be brushed into fsimage, merging Edits**.log and Fsimage
3, to prevent the impact of response speed, by Secondarynamenode to do Edit**.log and Fsimage merge work, when Edits**.log is full, Notify Secondarynamenode to checkpoint operation, stop writing data to edits file, Secondarynamenode download fsimage and edits files, merge to generate new Fsimage, Pass the new memory image to Namenode, replace the old fsimage, delete the old edit**.log, name the edits new file Edits**.log By doing this, you can see that when the task is done, There is no loss of data when power is lost at the point of the mission.