HDFs Memory Storage

Source: Internet
Author: User
Tags commit current time sleep touch
Preface

The previous article focused on the HDFS cache caching knowledge, this article continues to lead you to understand the HDFs memory storage related content. In HDFs, the target file cache for cacheadmin is stored in datanode memory. But in another case, the data can be stored in datanode memory. The memory storage policy mentioned in the previous HDFs heterogeneous storage,lazy_persist. In other words, This article is also a more granular analysis of the HDFS memory storage strategy. It makes sense to make such an analysis, given the difference between lazy_persist memory storage and other types of storage policies. how HDFs memory is stored

For memory storage, many people may have such views, the first, the data is temporarily maintained in memory, the service stopped, the data is all. Second, the data is still present with the memory species, but when the service is stopped, it is persisted and eventually written to disk.

Take a closer look at these 2 ideas, there are actually a few flaws.
First, the first point of view, once the service is stopped, the memory data is lost, this is unacceptable, we can tolerate a small amount of data loss in memory, but the total loss is not particularly good treatment . And this is a bit unreasonable, memory storage space is limited, if not in time to store part of the data, The memory space will be exhausted sooner or later.

Then there is the second point, and the second scenario is to do a persistent operation when the service stops exiting, but he also faces the limitations of the memory space mentioned above. and assuming that the memory of the machine is large enough, the last stage of writing to the disk must not be so fast, because the data can be many .

So the general general good practice is to do the asynchronous persistence , what does that mean?

Memory stores new data at the same time, persisting the data farthest from the current time (the earliest storage time)

For a popular explanation, like I have a memory block queue, there are constantly new chunks inserted in the head of the queue, which is the block to be stored, because the resources are limited, I want to persist the block at the end of the queue, that is, the block of the earlier time to disk, Then there is room to save the new block. Then form such a loop, the new block is added, the old block is removed, ensuring that the overall data is updated .

The Lazy_persist memory storage strategy for HDFS is this approach. Here is a schematic:

The principle described above in the figure is actually 4, 6, the step . Write data to RAM, and then asynchronously write to disk. The previous steps are how to set up the Storagetype operation, which is mentioned in the following form. So the general steps shown in the figure above can be summarized as follows: The first step is to set the memory storage policy Storagepolicy to lazy_persist for the target file directory. In the second step, the client process initiates a request to Namenode to create/write a file. The third step, and then request to the specific Datanode,datanode, will write these blocks into RAM memory, and start the asynchronous threading service to persist the memory data to disk.

Asynchronous persistent storage of memory is clearly different from where other media stores data. This should also be the source of the name of the lazy_persist, the data is not immediately on the disk, but "lazy persisit" lazy way, deferred processing. Linux virtual Memory disk

Here you need to know an additional point of knowledge,Linux virtual memory disk . I have always had a doubt, memory can also be used as a block disk use? is the memory used for temporary data storage ? So before learning this module knowledge, deliberately checked the relevant information. in fact, in Linux, can be used to simulate the memory of a block disk technology, called RAM disk. This is a simulated disk where the real data is stored in memory . The RAM disk virtual memory disk can be used in conjunction with certain memory-based storage file systems, such as Tmpfs,ramfs. About TMPFSD Baidu Encyclopedia link contact this. With this technology, we can use machine memory, It is used as a standalone virtual disk for Datanode. analysis of the memory storage process in HDFs

The following will be the core of this article, which is the main process operation of HDFS memory storage. Do not underestimate this is just a single storagepolicy, the process is not simple, in the following process, I will give a more process diagram of the show, to help you understand. HDFs file memory storage settings

To store the file data in memory, the first thing you need to do is set the storage policy for this file, which is the lazy_persist mentioned above, instead of using the default Storagepolicy.default, the storage medium for the default policy is disk type. There are 2 ways to set up a storage policy: the first, by the command line, the following command is called

HDFs Storagepolicies-setstoragepolicy-path <path>-policy lazy_persist

Convenient, fast. The second, by calling the corresponding program method, such as the call exposed to the external create file method, but with the parameter createflag.lazy_persist. Examples are as follows:

Fsdataoutputstream fos =
        fs.create (
            path,
            fspermission.getfiledefault (),
            enumset.of ( Createflag.create, Createflag.lazy_persist),
            bufferlength,
            replicationfactor,
            blockSize,
            null);

The above method is finally called the dfsclient create the same name, as follows:

  /**
   * Call {@link #create (String, Fspermission, Enumset, Boolean, short,
   * long, progressable, int, checksumopt) } with <code>createParent</code>
   *  set to True.
   *
  /Public dfsoutputstream Create (String src, fspermission permission,
      enumset<createflag> flag, short Replication, long blockSize,
      progressable progress, int buffersize, checksumopt checksumopt)
      throws IOException {
    return Create (SRC, permission, flag, True,
        replication, blockSize, progress, buffersize, Checksumopt, NULL);
  }

The

method is called by the RPC layer layer, after fsnamesystem, eventually to the Fsdirwritefileop Startfile method, inside this method, there will be set action

  Static Hdfsfilestatus Startfile (Fsnamesystem FSN, fspermissionchecker pc, String src, permissionstatus Perm Issions, string holder, string clientmachine, Enumset<createflag> flag, Boolean createparent, short Repl Ication, Long blockSize, Encryptionkeyinfo ezinfo, Inode.blocksmapupdateinfo toremoveblocks, Boolean Logretrye

    Ntry) throws IOException {assert fsn.haswritelock ();
    Boolean create = Flag.contains (createflag.create);
    Boolean overwrite = Flag.contains (createflag.overwrite);

    Determine if Createflag has a lazy_persist identifier to determine if it is a Boolean islazypersist = Flag.contains (createflag.lazy_persist) of the memory storage policy; ...//Then set the policy in this setnewinodestoragepolicy (Fsd.getblockmanager (), NewNode, IIP, Islazyp
    Ersist);
    Fsd.geteditlog (). Logopenfile (SRC, newNode, overwrite, logretryentry); if (NameNode.stateChangeLog.isDebugEnabled ()) {NameNode.stateChangeLog.debug ("dir* NameSystem.startFile:added" + src + "inode" + newnode.getid () + "" + holder);
  } return Fsdirstatandlistingop.getfileinfo (FSD, SRC, false, Israwpath); }

So this part of the procedure call graph is as follows:

OK, the above is the pre-storage policy setup process, this part is very straightforward. lazy_persist Memory Storage

Here, jump directly to datanode how to do memory storage, when we set the file to lazy_persist storage mode. I'll do the sub-module, the role of the introduction. LAZY_PERSIST Related Structures

As already mentioned in the previous space, there will be another batch of data being persisted asynchronously, so there will be a lot of cooperation between the service objects. The conductors of these service objects are Fsdatasetimpl. He is a steward of the Datanode all disk read and write data.

In Fsdatasetimpl, there are 3 service objects associated with memory storage.

Here are the following:

Lazywriter:lazywriter is a threading service that continuously loops out chunks of data from a block list and joins it to the asynchronous persistence thread pool Ramdiskasynclazypersistservice.

Ramdiskasynclazypersistservice: This object is the asynchronous persistence Threading Service, which sets up a corresponding thread pool for each disk block . The data blocks that need to be persisted to the given disk block are then committed to the corresponding thread pool. The maximum number of threads per thread pool is 1.

Ramdiskreplicalrutracker: A replica block tracking class that maintains all persisted, non-persisted replicas, and total replica data information. So when a copy is eventually stored in the memory, the corresponding changes are made to the queue information that the replica belongs to. Second, when the node memory is low, Some of the most recently inaccessible replica blocks are removed from this class.

Combined with the close cooperation of the above 3, the memory storage of HDFs is realized. The following is a detailed description of the role. Ramdiskreplicalrutracker

Of the above 3, Ramdiskreplicalrutracker's role played an intermediary role. Because he maintains data block information for multiple relationships internally. The main is the following 3 categories.

public class Ramdiskreplicalrutracker extends Ramdiskreplicatracker {
  ...
  /**
   * Map of Blockpool ID to <map of Blockid to Replicainfo>.
   */
  map<string, Map<long, ramdiskreplicalru>> replicamaps;

  /**
   * Queue of replicas that need to being written to disk.
   * Stale entries is GC ' d by Dequeuenextreplicatopersist.
   */
  queue<ramdiskreplicalru> replicasnotpersisted;

  /**
   * Map of persisted replicas ordered by their last use times.
   */
  Treemultimap<long, ramdiskreplicalru> replicaspersisted;
  ...

The queue here is for memory storage queues. The graph of the above 3 variables is as follows

The majority of the methods in Ramdiskreplicalrutracker are related to the additions and deletions of these 3 variables, so the logic is not complicated and we just need to understand what these methods do. I have divided into 2 categories:

The first class, the asynchronous persistence operation-related methods. Figure:

When the node restarts or a new file is set with the Lazy_persist policy, a new copy block is stored in memory. It will also be added to the replicanotpersisted queue. Then go through the middle of the dequeuenextreplicatopersist and take out the next copy block that will be persisted to write the disk operation. Recordstartlazypersist, Recordendlazypersist These 2 methods are called during the persistence process, marking a change in the persistence state.

Another class, there is no direct correlation method for asynchronous persistence operations. Figure:

There are 3 ways: Discardreplica: When this copy has been detected when it is not needed, including the deleted, or corrupted situation, can be removed from memory, revocation. Touch: The same name as the Linux touch, this method means that a specific copy block is accessed once and the lastuesdtime of the replica block is updated. Lastuesdtime will play a key role in the LRU algorithm mentioned later. Getnextcandidateforeviction: This method is called when there is insufficient memory space in the Datanode, which requires additional memory to be reserved for the new copy block. This method selects the block that needs to be removed, depending on the set of eviction scheme mode. The default is the LRU policy.

Here repeatedly referred to a noun,LRU, his full name is Least recently used, meaning for the least recently used algorithm, the related link contact this, getnextcandidateforeviction the advantage of using this algorithm is ensures an active level of the existing replica block, removing the most recently inaccessible ones . For this operation, it is necessary to understand the details.

First touch updates the last time you visit

  synchronized void Touch (final String bpid,
                          final long blockid) {
    Map<long, ramdiskreplicalru> Map = Replica Maps.get (bpid);
    Ramdiskreplicalru Ramdiskreplicalru = Map.get (blockid);

    ...

    Reinsert the replica with its new timestamp.
    Update the last access timestamp and reinsert the data
    if (Replicaspersisted.remove (Ramdiskreplicalru.lastusedtime, Ramdiskreplicalru)) {
      Ramdiskreplicalru.lastusedtime = Time.monotonicnow ();
      Replicaspersisted.put (Ramdiskreplicalru.lastusedtime, RAMDISKREPLICALRU);
    }
  }

Then the second step is to get the candidate removal block

  Synchronized Ramdiskreplicalru getnextcandidateforeviction () {
    //Get replicaspersisted iterator to traverse
    final Iterator <RamDiskReplicaLru> it = replicaspersisted.values (). iterator ();
    while (It.hasnext ()) {
      //Because the replicaspersisted has been sorted by time, removing the current block is
      final RAMDISKREPLICALRU RAMDISKREPLICALRU = It.next ();
      It.remove ();

      Map<long, ramdiskreplicalru> replicamap =
          replicamaps.get (ramdiskreplicalru.getblockpoolid ());

      if (Replicamap! = null && replicamap.get (Ramdiskreplicalru.getblockid ()) = null) {
        return RAMDISKREPLICALRU;
      }

      The replica no longer exists, look for the next one.
    }
    return null;
  }

What is interesting here is that the filter is removed based on the access time of the persisted block, not directly in-memory blocks . Finally, remove the block that belongs to the same replica information as the candidate block and free up memory space in memory.

    /** * Attempt to evict one or more transient block replicas until we * has at least bytesneeded bytes free.

      */public void evictblocks (long bytesneeded) throws IOException {int iterations = 0;
      Final Long cachecapacity = Cachemanager.getcachecapacity (); When the memory space is detected that does not meet the external needs of the size while (iterations++ < max_block_evictions_per_iteration && (Cachecapa City-cachemanager.getcacheused ()) < bytesneeded) {//Get copy information for removal Ramdiskreplica replicastate = RamDisk

        Replicatracker.getnextcandidateforeviction ();
        if (replicastate = = null) {break;
        } if (log.isdebugenabled ()) {Log.debug ("evicting block" + replicastate);
          } ...//Remove the related block in memory and free up space//Delete the Block+meta files from RAM disk and release locked
          Memory. Removeoldreplica (Replicainfo, Newreplicainfo, Blockfile, MetaFile, blockfileused, Metafileused, Bpid); }
      }
    }
LazyWriter

LazyWriter is a threading service, and he is an engine that loops constantly out of the queue for persisted chunks of data and commits to the asynchronous persistence service. Take a direct look at the main run method.

 public void run () {int numsuccessivefailures = 0; while (fsrunning && shouldrun) {try {//takes out a new copy block and commits to the asynchronous service, returns whether a successful Boolean value was committed Numsuccessivefa Ilures = Savenextreplica ()?

          0: (numsuccessivefailures + 1); Sleep If we have the no more work to does or if it looks like we is not//making any forward progress.
          This was to ensure, if all persist//operations be failing we don ' t keep retrying them in a tight loop. if (Numsuccessivefailures >= ramdiskreplicatracker.numreplicasnotpersisted ()) {Thread.Sleep (CHECKP
            Ointerinterval * 1000);
          numsuccessivefailures = 0;
          }} catch (Interruptedexception e) {log.info ("LazyWriter was interrupted, exiting");
        Break
        } catch (Exception e) {Log.warn ("ignoring Exception in LazyWriter:", e); }
      }
    }

To enter the processing of the Savenextreplica method

    Private Boolean Savenextreplica () {Ramdiskreplica block = null;
      Fsvolumereference targetreference;
      Fsvolumeimpl Targetvolume;
      Replicainfo Replicainfo;

      Boolean succeeded = false;
        try {//remove new blocks from the queue for persistence block = Ramdiskreplicatracker.dequeuenextreplicatopersist ();
              if (block! = null) {synchronized (fsdatasetimpl.this) {...///Commit to Async service
                  Asynclazypersistservice.submitlazypersisttask (Block.getblockpoolid (), Block.getblockid (), Replicainfo.getgenerationstamp (), Block.getcreationtime (), Replicainfo.getmetafile (), REPLICAINFO.G
            Etblockfile (), targetreference);
      }}} succeeded = true;
      } catch (IOException IoE) {Log.warn ("Exception saving replica" + block, IoE); } finally {if (!succeeded && block! = null) {Log.warn ("Failed toSave Replica "+ Block +".
          Re-enqueueing it. ");
        Onfaillazypersist (Block.getblockpoolid (), Block.getblockid ());
    }} return succeeded; }

So the flowchart of the LazyWriter threading service can be summarized as follows:

Then we combine lazywriter and Ramdiskreplicatracker tracking service, we can get the following complete process (regardless of Ramdiskasynclazypersistservice's internal execution logic).

Ramdiskasynclazypersistservice

The content of the last part of the asynchronous service is relatively simple, mainly around the 2 parts of the volume disk and the executor thread pool. Uphold the following principle

One disk service corresponds to a thread pool, and the maximum number of threads for a thread pool is only 1.

The thread pool list is defined as follows

Class Ramdiskasynclazypersistservice {
...
  Private Map<file, threadpoolexecutor> executors
      = new Hashmap<file, threadpoolexecutor> ();
...

The file here represents the directory where a separate disk is located, and the individual thinks it is entirely possible to replace it with a string string. It can reduce storage space and is straightforward. So here you can see that it is 1 to 1 relationship.
When the service starts, a new disk directory is added.

  synchronized void Addvolume (File volume) {
    if (executors = = null) {
      throw new RuntimeException (" Asynclazypersistservice is already shutdown ");
    }
    Threadpoolexecutor executor = executors.get (volume);
    If the thread pool corresponding to this disk directory is currently present, run the exception if
    (executor! = null) {
      throw new RuntimeException ("Volume" + Volume + "is already ex Isted. ");
    Otherwise add
    addexecutorforvolume (volume);
  }

Enter the Addexecutorforvolume method

  private void Addexecutorforvolume (final File volume) {
    ...
    New thread pool, maximum thread execution number is
    threadpoolexecutor executor = new Threadpoolexecutor (
        Core_threads_per_volume, maximum_ Threads_per_volume,
        threads_keep_alive_seconds, Timeunit.seconds,
        new linkedblockingqueue<runnable > (), threadfactory);

    This can reduce the number of running threads
    Executor.allowcorethreadtimeout (true);
    Added to executors, thought volume as Key
    executors.put (volume, executor);
  }

One more note is to commit the execution method Submitlazypersisttask.

  void Submitlazypersisttask (String bpid, Long blockid, long genstamp, long creationtime, file metaFile, file Blockfile, Fsvolumereference target) throws IOException {if (log.isdebugenabled ()) {Log.debug ("lazywrite
    R Schedule Async task to persist RamDisk block Pool ID: "+ bpid +" Block ID: "+ blockid);
    }//Get required to persist to the target disk instance Fsvolumeimpl volume = (fsvolumeimpl) target.getvolume ();
    File Lazypersistdir = Volume.getlazypersistdir (bpid); if (!lazypersistdir.exists () &&!lazypersistdir.mkdirs ()) {FsDatasetImpl.LOG.warn ("LazyWriter failed to CRE
      Ate "+ lazypersistdir);
    throw new IOException ("LazyWriter fail to find or create lazy persist dir:" + lazypersistdir.tostring ()); }//New This service task Replicalazypersisttask Lazypersisttask = new Replicalazypersisttask (Bpid, Blockid, Genstam
    P, CreationTime, Blockfile, MetaFile, Target, Lazypersistdir); Execute EXECU in the thread pool submitted to the corresponding volumeTe (Volume.getcurrentdir (), lazypersisttask); }

If a failure occurs during the above execution, the failed processing method is called and the copy block is re-inserted into the replicatenotpersisted queue for the next persistence.

  public void Onfaillazypersist (String bpid, long blockid) {
    Ramdiskreplica block = null;
    block = Ramdiskreplicatracker.getreplica (bpid, blockid);
    if (block! = null) {
      Log.warn ("Failed to save Replica" + Block + ". Re-enqueueing it. ");
      Reinsert queue Operation
      ramdiskreplicatracker.reenqueuereplicanotpersisted (block);}
  }

Other methods such as Removevolume are relatively simple to implement and do not introduce too much here. The following is the general structure of Ramdiskasynclazypersistservice:

Based on the contents of the above 3 parts, this paper mainly describes the sequence of the FIFO first-in- order memory data block persistence under Lazt_persist, the internal running logic of the asynchronous persistence service and the LRU algorithm to remove the data copy block to reserve the memory space . lazy_persist use of memory storage

After the introduction of the above-mentioned principles, the final supplement to the specific configuration is used.
The first thing to do is to use the Lazy_persist memory storage policy, which requires a corresponding storage medium, and the type of memory storage medium corresponds to Ram_disk.

So the first step is to configure the completed RAM disk virtual memory disk in the machine into the configuration item Dfs.datanode.data.dir, followed by the ram_disk tag. The following:

<property>
      <name>dfs.datanode.data.dir</name>
      <value>/grid/0,/grid/1,/grid/2,[ Ram_disk]/mnt/dn-tmpfs</value>
    </property>

Note that this tag must be hit, otherwise HDFS will default to disk.

The second step is to set the policy type of the specific file , as mentioned above.

And then comes with 2 caveats: Make sure that HDFS heterogeneous storage policy is not turned off, is on by default, and is configured to be dfs.storage.policy.enabled. Make sure that the configuration dfs.datanode.max.locked.memory has a large enough memory value that is the maximum memory size that datanode can withstand. Otherwise too small will cause the total number of stored chunks in memory to become less, if more than that, move directly out . RELATED LINKS

1.http://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/memorystorage.html
2. Baidu Encyclopedia. Tmpfs
3. Baidu Encyclopedia. RAM disk
4. Baidu Encyclopedia. LRU algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.