File block, the primitive of the Hadoop file system, the smallest unit stored in the Hadoop Distributed File system. A Hadoop file consists of a series of blocks scattered across different datanode.
2.7 dfsclientFor Distributed File system clients, users can obtain a client instance and interact with Namenode and Datanode, dfsclient through the client protocol and the Hadoop file system.
2.8 LeaseLease, when the client creates or opens a file and prepares to write, Namenode maintains a file lease to mark who is writing to the file. The client needs to update the lease periodically, or when the lease expires, nn closes the file or gives the file a lease to the other client.
2.9 LeaserenewerRenew a managed thread when a dfsclient call requests a lease, if the thread has not yet started, start and renew the Namenode periodically.
three. Create a fileWhen a distributed cluster of Hadoop is started, a file can be created through FS or the shell, and the command for the FS to create the file is as follows:
//cluser is a Hadoop cluster that interacts with the FS and cluster file systemsFinalDistributedfilesystem fs = Cluster.getfilesystem ();//The file name to createFinalPath TmpFile1 =NewPath ("/tmpfile1.dat");//Create file Public Static void CreateFile(FileSystem FS, Path FileName,LongFileLen, ShortReplfactor,LongSeedthrowsIOException {if(!fs.mkdirs (Filename.getparent ())) {Throw NewIOException ("Mkdirs failed to create"+ filename.getparent (). toString ()); } Fsdataoutputstream out =NULL;Try{out = Fs.create (FileName, replfactor);byte[] Towrite =New byte[1024x768]; Random RB =NewRandom (seed);LongBytestowrite = FileLen; while(bytestowrite>0) {rb.nextbytes (towrite);intBytestowritenext = (1024x768<bytestowrite)?1024x768:(int) Bytestowrite; Out.write (Towrite,0, Bytestowritenext); Bytestowrite-= Bytestowritenext; } out.close (); out =NULL; }finally{Ioutils.closestream (out); } }
Iv. Process AnalysisCreate a file named tmpfile1.dat
, the main process is as follows:
4.1 Sending a request to create a file (CreateFile)The client initiates a request to the NN, obtains the file information, and the NN looks in the cache for the file entry that the request was created in, and if it is not found, creates a new file entry in Namesystem: entry
The Block Manager (Blockmanager) checks to see if the replication factor is in scope and is abnormal if the replication factor is too small or too large.
Permission validation, encryption, security mode detection (if the file cannot be created in Safe mode) are also performed, and the operation log and event log are logged, and the file status is returned to the client.
4.2 Application for Lease of documents (Beginfilelease)After the client obtains the file status, requests a lease for the file (lease), and if the lease expires, the client will no longer be able to continue accessing the file unless the lease is renewed.
4.3 Data flow control thread initiation (Datastreamer & Responseprocessor)The Datastreamer thread is responsible for the actual delivery of the data:
When the data queue is empty, it sleeps and wakes up periodically to detect whether the data queue has new data that needs to be sent, whether the socket socket timed out, or if it continues to sleep.
Responseprocessor is responsible for receiving and processing pipeline
downstream data received acknowledgement pipelineACK
.
4.4 Send Add block request and initialize data pipeline (Addblock & Setup Pipeline)When the new data needs to be sent, and the block creation phase is in Pipeline_setup_create,datastreamer and Namenode communication, call the Addblock method, notify the NN to create, allocate new blocks and locations, The NN will initialize the pipeline and send the stream.
4.5 datanode Data receive service thread started (Dataxceiverserver & Dataxceiver)When Datanode is started, its internal dataxceiverserver component starts, and this thread manages the connection that sends data to the DN to which it belongs, and when the new connection comes, Dataxceiverserver initiates a dataxceiver thread, This thread is responsible for the data reception work that flows to the DN.
4.6 Processing of data sent and received in pipelineAfter the client obtains the network location of the Namenode allocated file block, it can interact with the Datanode that holds the block.
The client establishes the connection through SASL encryption and the DN and sends the data through the pipeline.
4.6.1 receiving data from pipelinePipeline consists of a data source node and multiple data destination nodes, please refer to the flowchart above.
The first Datanode in pipeline receives the data stream from the client, its internal dataxceiver component, which is distinguished by the read operation type (OP), as follows:
protected final void processop (OP op ) throws ioexception {switch (OP) {case read_block:opreadblock (); break ; //in this example will use the Write_block Directive case Write_block:opwriteblock (in); break ; //slightly ... default : throw New IOException ( "Unknown op" + op + "in data stream" ); } }
If OP is Write_block, the method that writes the block is called, and this method makes different logic depending on whether the data source is a client or another datanode, a block-created stage, and so on.
4.6.2 Data flow in the pipelineIn this example, the first DN that receives the data will then start a blockReceiver
thread to receive the actual block data, and after the block data is saved locally, it is responsible for pipeline
continuing to send block data to subsequent DN in.
Each time the data is sent to the downstream DN node, the array of data destination nodes is targets
excluded, so that the length is controlled pipeline
.
The DN of the downstream receive block data reports the data receive status to the upstream DN or client.
This type of data transfer, either chained or serialized, is referred to as the flow of data from upstream to downstream in the pipeline pipeline
.
the life cycle of 4.6.3 pipeline
In this example:
DataStreamer
After the thread starts, pipeline
enter the PIPELINE_SETUP_CREATE
stage;
After the data stream is initialized, it pipeline
enters the DATA_STREAMING
stage;
After the data is sent, pipeline
enter the PIPELINE_CLOSE
stage.
After the DataStreamer
thread starts, the client initiates a ResponseProcessor
thread that is used to receive the pipeline
data receive status report from the downstream node pipelineACK
while the thread and thread are DataStreamer
coordinating the management pipeline
State.
When Datastreamer sends data to pipeline, it removes the Sent packet (packet) from the data queue ( Data Queue
) and joins the data acknowledgement queue ( Ack Queue
):
//DataStreamer发送数据后,将dataQueue的第一个元素出队,并加入ackQueueone = dataQueue.getFirst();dataQueue.removeFirst();ackQueue.addLast(one);
And when Responseprocessor receives downstream pipelineack, it confirms the information to determine the pipeline status, whether it needs to reset and readjust. If the confirmation message is that the downstream node data is received successfully, the first packet of the acknowledgment queue (ackqueue) is deleted.
//ResponseProcessor收到成功的Ack,就将ackQueue的第一个包移除lastAckedSeqno = seqno;ackQueue.removeFirst();dataQueue.notifyAll();
In this way, the Datastreamer can confirm that the packet was sent successfully or that all packets have been sent.
Obviously, when Ackqueue is empty, and the packet that has been sent is the last packet in the block, the data is sent.
The decision to send is complete as follows:
if(One.lastpacketinblock) {//wait for all data packets has been successfully acked synchronized(Dataqueue) { while(!streamerclosed &&!haserror && ackqueue.size ()! =0&& dfsclient.clientrunning) {Try{//wait for ACKs-arrive from DatanodesDataqueue.wait ( +); }Catch(Interruptedexception e) {DFSClient.LOG.warn ("Caught Exception", e); } } }if(streamerclosed | | haserror | |!dfsclient.clientrunning) {Continue; }//In the absence of errors, the Ackqueue is empty, and the package one is the last packet of the block, the data is sent outstage = Blockconstructionstage.pipeline_close; }
4.7 Send file operation completion request (Completefile)The client sends a COMPLETEFILE request to Namenode:
After the request is received, the NN verifies that the blockpoolid of the block is correct, and then verifies the operation permissions, file write lock (write lock), security mode, lease, inode presence, inode type, and so on, and finally logs the operation log and returns it to the client.
4.8 Stop file Lease (endfilelease)After the client completes the file write operation, invokes the Leaserenewer (LR) instance, removes the file from the LR-Managed Renewal File table, indicating that the lease is no longer updated, and that the lease expires naturally on the NN side after a period of time.
"Reprint please indicate the source, and keep the blog hyperlink and copyright notice"
"Copyright @foreach_break Blog http://blog.csdn.net/gsky1986"
"Hadoop" HDFS-Create file process details