1. namenode start metadata loading Scenario Analysis
- The namenode function calls fsnamesystemm to read DFS. namenode. Name. dir and DFS. namenode. edits. dir to build fsdirectory.
- Fsimage recovertransitionread and savenamespace implement metadata check, loading, memory merging, and persistent storage of metadata.
- Savenamespace writes metadata to the disk. Procedure: First rename the current directory to lastcheckpoint. TMP; then, create a new current directory and save the file. rename TMP to privios. checkpoint.
- Checkpoint Process: Secondary namenode notifies namenode to generate an edit Log File edits. New, and then all log operations are written to the edits. New file. Next, secondary namenode downloads the fsimage and edits files from namenode and merges them to generate a new fsimage. ckpt. Then secondary uploads the fsimage. ckpt file to namenode. Finally, namenode will rename fsimage. ckpt to fsimage, and edtis. New to edits;
2. metadata update and log writing scenario analysis taking mkdir as an example: logsync code analysis: code:
public void logSync () throws IOException {ArrayList<EditLogOutputStream > errorStreams = null ;long syncStart = 0;// Fetch the transactionId of this thread.long mytxid = myTransactionId .get (). txid;EditLogOutputStream streams[] = null;boolean sync = false;try {synchronized (this) {assert editStreams. size() > 0 : "no editlog streams" ;printStatistics (false);// if somebody is already syncing, then waitwhile (mytxid > synctxid && isSyncRunning) {try {wait (1000 );} catch (InterruptedException ie ) {}}//// If this transaction was already flushed, then nothing to do//if (mytxid <= synctxid ) {numTransactionsBatchedInSync ++;if (metrics != null) // Metrics is non-null only when used inside name nodemetrics .transactionsBatchedInSync .inc ();return;}// now, this thread will do the syncsyncStart = txid ;isSyncRunning = true;sync = true;// swap buffersfor( EditLogOutputStream eStream : editStreams ) {eStream .setReadyToFlush ();}streams =editStreams .toArray (new EditLogOutputStream[editStreams. size()]) ;}// do the synclong start = FSNamesystem.now();for (int idx = 0; idx < streams. length; idx++ ) {EditLogOutputStream eStream = streams [idx ];try {eStream .flush ();} catch (IOException ie ) {FSNamesystem .LOG .error ("Unable to sync edit log." , ie );//// remember the streams that encountered an error.//if (errorStreams == null) {errorStreams = new ArrayList <EditLogOutputStream >( 1) ;}errorStreams .add (eStream );}}long elapsed = FSNamesystem.now() - start ;processIOError (errorStreams , true);if (metrics != null) // Metrics non-null only when used inside name nodemetrics .syncs .inc (elapsed );} finally {synchronized (this) {synctxid = syncStart ;if (sync ) {isSyncRunning = false;}this.notifyAll ();}}}
3. Process Analysis of the checkpoint of the backup node:
/*** Create a new checkpoint*/void doCheckpoint() throws IOException {long startTime = FSNamesystem.now ();NamenodeCommand cmd =getNamenode().startCheckpoint( backupNode. getRegistration());CheckpointCommand cpCmd = null;switch( cmd. getAction()) {case NamenodeProtocol .ACT_SHUTDOWN :shutdown() ;throw new IOException ("Name-node " + backupNode .nnRpcAddress+ " requested shutdown.");case NamenodeProtocol .ACT_CHECKPOINT :cpCmd = (CheckpointCommand )cmd ;break;default:throw new IOException ("Unsupported NamenodeCommand: "+cmd.getAction()) ;}CheckpointSignature sig = cpCmd. getSignature();assert FSConstants.LAYOUT_VERSION == sig .getLayoutVersion () :"Signature should have current layout version. Expected: "+ FSConstants.LAYOUT_VERSION + " actual " + sig. getLayoutVersion();assert !backupNode .isRole (NamenodeRole .CHECKPOINT ) ||cpCmd. isImageObsolete() : "checkpoint node should always download image.";backupNode. setCheckpointState(CheckpointStates .UPLOAD_START );if( cpCmd. isImageObsolete()) {// First reset storage on disk and memory statebackupNode. resetNamespace();downloadCheckpoint(sig);}BackupStorage bnImage = getFSImage() ;bnImage. loadCheckpoint(sig);sig.validateStorageInfo( bnImage) ;bnImage. saveCheckpoint();if( cpCmd. needToReturnImage())uploadCheckpoint(sig);getNamenode() .endCheckpoint (backupNode .getRegistration (), sig );bnImage. convergeJournalSpool();backupNode. setRegistration(); // keep registration up to dateif( backupNode. isRole( NamenodeRole.CHECKPOINT ))getFSImage() .getEditLog (). close() ;LOG. info( "Checkpoint completed in "+ (FSNamesystem .now() - startTime )/ 1000 + " seconds."+ " New Image Size: " + bnImage .getFsImageName (). length()) ;}}
4. Metadata reliability mechanism.
- Configure multiple backup paths. When namenode updates logs or performs checkpoints, it stores metadata in multiple directories.
- If no metadata file needs to be saved, an output stream is created to process the abnormal output stream during access and remove it. At the right time, check again whether the removed data volume has recovered. This effectively ensures the exception of the backup output stream.
- Multiple mechanisms are used to ensure the reliability of metadata. For example, in the checkpoint process, there are several stages in which different file names are used to identify the current status. This provides the possibility of restoring a storage failure.
5. Metadata consistency mechanism.
- When starting from namenode, check whether each Backup Directory is formatted and whether the directory metadata file name is correct to ensure the State consistency among the metadata files, and then select the latest load to the memory, this ensures that the current status of HDFS is consistent with that of the last shutdown.
- Second, the handling of abnormal output streams ensures data consistency of normal output streams.
- The synchronization mechanism is used to ensure the consistency of the output stream.