Copyright notice: This article by Xun Xunde original article, reprint please indicate source:
Article original link: https://www.qcloud.com/community/article/258
Source: Tengyun https://www.qcloud.com/community
This document analyzes from the source point of view, HBase as Dfs client writes to HDFS's Hadoop sequence file The final brush disk process.
Previously described in the Wal threading model source code Analysis of the Wal's writing process is written into the Hadoop sequence file, hbase in order to ensure the security of data, is generally written to the Hadoop ecosystem of HDFs (Hadoop distribute file System). The end result of the append is to write using Write.append (), while sync () uses the Write.sync () brush disk. This is not really the end, in order to ensure the security of data, HDFs can be based on the user's configuration to write to multiple datanode nodes, whether hfile or fshlog are not simply write or brush into (flush) The real storage node--datanode, It involves how the data flow (walentry) can be safely and efficiently written to the Datanode file, and how the flush is specifically done, this document will analyze HBase's "write" operations from the source to Wirter.append () and Writer.sync ( ) After what exactly happened, how the landing.
is the top-level structure of the hbase underlying storage structure described in the "HBase authoritative guide". You can see that HBase will handle both the hfile file (Memstore build) and the Hlog file (Wal generation), both of which will have Hregionserver management, and when it is actually stored in HDFs, the DFS The client, as HDFs, writes a large amount of these two streams of data to multiple datanode nodes.
In the document "Wal threading model Source Analysis" in order to highlight the focus on the Wal-threading model, does not specify the Writer.append () and Writer.sync () in the writer instance is what, An interface declared as a walprovider.writer type in fshlog by the volatile keyword modifier:
In fact its implementation class is Protobuflogwriter, This class is also in the Org.apache.hadoop.hbase.regionserver.wal package, in the Wal package as Wal to Datanode writer, it is used in Fshlog Factory mode createwriterinstance () instantiate, and then call the init () method to initialize:
As you can see from the source code, the real example is Fsdataoutputstream, which is used to write data to the newly generated file, as described earlier, initialized in the init () method of Protobuflogwriter:
Here we only discuss the file system using HDFs as HBase, that is, the init parameter in which FS (System) is an instance of Distributedfilesystem. In addition to the parameters of the path parameter that specify the file path that needs to be created in HDFs, the parameter of the Createnonrecursive implementation is important, as well as a replication parameter, which indicates the number of backups is also the number of write datanode copies. DFS in Distributedfilesystem is an instance reference to Dfsclient, which is the DFS Client that was referred to in the schema diagram at the very beginning. HBase uses the Dfsclient creation method to create a file from an RPC call to the namenode of HDFs and constructs an instance of the output stream Dfsoutputstream, which in turn initiates a pipeline, The specific invocation is Streamer.start (), which is an implementation of hbase writing to multiple Datanode pipes in HDFs. Although the analysis here is the Wal's writing process, but in fact keyvalue write to Memstore, and then write to hfile after the same way the pipeline write (pipeline) implementation.
Call Namenode's create function via RPC, call the Namesystem.startfile function, and call the Startfileinternal function, which creates a new file with a status of under construction, There is no data block corresponding to it. At the same time, when the creation succeeds, it returns an instance of the Dfsoutputstream type, called Wrappedstream in Fsdataoutputstream, which handles communication between Datanode and Namenode.
HDFS file structure, HDFs A file consists of multiple blocks (default 64MB). In this note, you can see that HDFs is in packet (the default is 64K per packet) for block reads and writes. Each packet is made up of several chunk (default 512Byte). Chunk is the basic unit for data validation, generating a checksum (default 4Byte) for each chunk and officer and storage.
Analysis to this, it can be seen that hbase files written to HDFs is not particularly, HDFs as HDFs of the client and then encapsulated into chunk and then assembled into packet, and then to datanode batch write data. In order to ensure the orderly transmission of data, the data sending queue Dataqueue and the queue Ackqueue to be confirmed are used, and two threads Dfsoutputstream$datastreamer and dfsoutputstream$datastreamer$ are used. Responseprocessor (in Run ()) respectively to send data to the corresponding block and confirm whether the data arrived.
Another important point is how hbase is putting data into Datanode disks.
Here, we have to go back to the Protobuflogwriter class because Writer.sync () is finally called the Protobuflogwriter writer method, its source code is as follows:
Which, output in the previously analyzed Fsdataoutputstream instance, in the sync () method called the Fsdataoutputstream flush and hflush, in fact, flush nothing to do (NoOp, the source also explains ), Hflush () invokes the Hflush method of the Dfsoutputstrem class that was previously mentioned, and the Hflush method immediately sends all client-cached data (packet) to datanodes and blocks until they write successfully. After Hflush, you can ensure that client-side failures do not result in data loss, but if Datanodes fails, there is still the possibility of losing data, and when Fsdataoutputstream shuts down, an additional flush operation is performed:
As explained in the note, Hflush is synchronous only to ensure that new reader can see it, but it does not guarantee that it will be truly persisted to every datanode, that is, the Fsync () system call in POSIX is not really called. It simply brushes the data written by the client to each datanode's OS cache (store), which can cause data loss if the Datanode of each replica is crash at the same time (for example, a room power outage).
HDFs also provides the client with another semantic hsync:client all the data is sent to each datanode of the replica, and each copy on the Datanode completes the invocation of Fsync in POSIX. This means that the operating system has already brushed the data onto the disk (and of course the disk may buffer the data); It should be noted that only the current block will be brushed to disk when Fsync is called, and that if each block is to be brushed to disk, the sync flag must be passed in when the stream is created.
HBase currently chooses the Hflush semantics. Both of these semantics are called by the Flushorsync method, where the issync of the Hflush call passes in false, and HSync is passed true.
The main function of this method is to brush the data that is still in the cache (buffered) into the Datanodes,
Some of the final methods are Flushbuffer (), Waitandqueuecurrentpackket () and Waitforackedseqno (), calling Waitandqueuecurrentpacket () Placing the current package in the Send Queue waitforackedseqno () waits for the package to be sent, and the same principle as writing data is to encapsulate the data as chunk in filling the chunk Again in the way of pipeline write to Datanode again according to whether there is sync logo brush disc.
Waitforackedseqno () is used to wait for the ACK in Ackqueue to come back and be awakened.
HBase Write HDFs source code analysis