Hadoop2.4's datanode multi-directory data copy storage policy

Source: Internet
Author: User

In hadoop2.0, there are two methods to select a disk for storing datanode data copies:

The first method follows the disk directory round-robin method of hadoop1.0. Implementation class: roundrobinvolumechoosingpolicy. Java

The second method is to select a disk with sufficient available space for storage. Implementation class: availablespacevolumechoosingpolicy. Java

The configuration items corresponding to the policy are:

  <property>    <name>dfs.datanode.fsdataset.volume.choosing.policy</name>    <value>org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy</value>  </property>

If this parameter is not configured, the first method is used by default. This method is used to poll and select disks to store data copies. However, the round-robin method ensures that all disks can be used, however, direct data storage on various disks is often not balanced, and some disks are fully occupied, while some disks may still have a lot of storage space to be used, all of which are in the hadoop2.0 cluster, it is best to configure the disk selection policy to the second type. You can select the disk storage data copy based on the remaining disk space. This ensures that all disks can be used, it also ensures that all disks are balanced.

When the second method is used, there are two other parameters used:

DFS. datanode. available-space-volume-choosing-policy.balanced-space-threshold

The default value is 10737418240, which is 10 Gb. Generally, the default value is used. The following is the official explanation of this option:

This setting controls how much DN volumes are allowed to differ in terms of bytes of free disk space before they are considered imbalanced. if the free space of all the volumes are within this range of each other, the volumes will be considered balanced and block assignments will be done on a pure round robin basis.

This means that two values are calculated first. One is the maximum available space of all disks, and the other is the minimum available space of all disks, if the difference between the two values is less than the threshold value specified by this configuration item, the disk storage data copy is selected using the round-robin disk selection policy. The source code is as follows:

public boolean areAllVolumesWithinFreeSpaceThreshold() {      long leastAvailable = Long.MAX_VALUE;      long mostAvailable = 0;      for (AvailableSpaceVolumePair volume : volumes) {        leastAvailable = Math.min(leastAvailable, volume.getAvailable());        mostAvailable = Math.max(mostAvailable, volume.getAvailable());      }      return (mostAvailable - leastAvailable) < balancedSpaceThreshold;    }


DFS. datanode. available-space-volume-choosing-policy.balanced-space-preference-fraction

The default value is 0.75f. Generally, the default value is used. The following is the official explanation of this option:
This setting controls what percentage of new Block allocations will be sent to volumes with more available disk space than others. this setting shoshould be in the range 0.0-1.0, though in practice 0.5-1.0, since there shoshould be no reason to prefer that volumes

It means that the proportion of data copies should be stored on a disk with enough space. The value range of this configuration item is 0.0-1.0. Generally, the value range is 0.5-1.0. If the configuration is too small, the disk with enough space is not actually allocated enough data copies, the disk with insufficient space needs to store more data copies, resulting in unbalanced disk data storage.

Refer:

Http://www.thebigdata.cn/wap.aspx? Nid = 11668 & cid = 8 & sp = 2


Hadoop2.4's datanode multi-directory data copy storage policy

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.