Openstack_swift Source Code Analysis--ring Basic principle and consistency hash algorithm

Source: Internet
Author: User

1. The basic concept of ring

The ring is the most important component in Swfit. Used to record the mapping between the storage object and the physical location, when the user needs the account, Container, object operation, it is necessary to query the corresponding ring file (the client, Container, object has its own corresponding ring), The Ring maintains this information using region (newly added in recent version numbers), Zone, Device, partition, and replica, and for each object, depending on the number of replica you set up in the deployment of Swift, Replica objects are stored in a cluster. Once deployed, the corresponding ring file is also created, such as the demo sample deployed in my previous blog, where the ring file is stored in the/etc/swift.

2. Ring fundamentals and consistency hash algorithm 2, 1 consistent hash algorithm

Swift uses a consistent hash algorithm to build a redundant, extended distributed object Storage cluster, and the primary purpose ofSwift 's consistent hashing is to change the number of device units in the cluster . You can change the mapping of existing Key and device as little as possible . In this paper, we will introduce some understanding of the consistency hash algorithm, and make a detailed understanding of the other reference data of the consistency hash algorithm.

The detailed code implementation of the consistent hash algorithm in swift, which I'll describe in more detail in the blog below.

2, 2, 1 ring annular space


Fig. 1 Annular ring

To see. The hash algorithm maps value to a 0-2**32-1 value space.

2.1.2 Object mapped to ring

If there are four objects, the corresponding hash value of each object is calculated by the hash function on the ring distribution such as the following.

Swift uses the MD5 hashing algorithm to hash the object according to its name :

Hash (object1) =key1;

...........

Hash (OBJECT4) =key4

Figure 2 The key value of object is mapped to the ring

2.1.3 Storage device mapped to ring

The corresponding storage device, using the same hash algorithm. The value is mapped to the ring. If there are three devices Device1. Device2. Device3. Their corresponding hash value is DEV1,DEV2,DEV3, whose distribution on the ring is, for example, seen.

Figure 3 Device mapping to ring


2.1.4 Mapping of objects and devices

Today, devices and objects are mapped to the ring using hash algorithms. So how to map the object to the device that will be stored na, in this ring. Each object starts from its position on the ring and moves in a clockwise direction until a device is encountered, and the object is deposited on the device, as seen. The key1 is mapped to Dev2. The Key4 maps to the Dev3,key3,key2 map to Dev1. Suppose you add a device device4, if its hash value is dev5. Its mapping on the ring is


Figure 4 The ring after the new device

For the new device Device4. The objects that need to be changed in the ring have only the corresponding objects of Key3. It finds the device4 instead of the Device1 when it is looking for the device clockwise, and the rest of the data is stored unchanged. This reduces the migration of the data.

2.1.5 Virtual Node

The balance is an important index of the hash algorithm, which means that all the objects of the storage, as evenly as possible, can not be divided into all devices. As you can see, if the hash algorithm calculates the object hash value most falls in the clockwise direction between Key4 to Key2. Or, as with the Method object mapping device, Device2 gets very little object storage. So that the other three devices have been subjected to a relatively large pressure. Obviously this design is not in line with the actual needs. The introduction of virtual nodes can solve this problem very well.

Virtual node is actually a replica of the actual node in space, an actual node corresponding to several virtual nodes, adding virtual node, object storage from Object-device mapping to object-virtual node-device mapping, virtual node in space by hash value.

If the virtual nodes are 20, they are evenly distributed in the ring. Each device has 5 virtual nodes for one. Object when the device is mapped becomes the process that you see:

Figure 5 Object-Virtual node-device mapping process

2.2 Ring Fundamentals

2.2.1ReplicaAssuming that there is only one copy of the data throughout the cluster, a problem can cause permanent loss of data. So. Redundant copies are required to ensure data security. The concept of replica is introduced in Swift, with a default value of 3, and detailed sub-trees can also be specified at deployment time. Theoretical basis is mainly derived from NWR strategy (also called Quorum protocol).

NWR is a strategy for controlling consistency levels in distributed storage systems. In Amazon's Dynamo cloud storage system, NWR is used to control consistency.

Of n is the number of copies of the replica for the same data, and W is the number of copies that need to ensure a successful update when updating a data object. R represents the number of copies of the replica to read for a data. The formula W+r>n to ensure that a data is not read and written by two different transactions at the same time. The formula W>N/2 guarantees that two transactions cannot write a certain data concurrently.

In distributed systems, the single point of data is not agreed to exist. It is critical that the number of replica that are normally present on the line is 1, and the permanent error of the data can occur as soon as the replica is faulted again.

If we set N to 2, then only one storage node would be corrupted, there would be a single point of presence. So n must be greater than 2. n Higher. The higher the maintenance cost and the overall cost of the system.

Industry typically sets N to 3.

2.2.2 ZoneAssume that all the devices are in a rack or a room. If there is a power outage, network failure, and so on. Will cause users to be unable to access the questions. Therefore, a mechanism is needed to isolate the physical location of the machine to accommodate partition tolerance (p in Cap theory). So. The ring introduces the concept of zone, assigning the cluster's device to each zone.

In Swift, the same partition replica cannot be in the same zone, meaning that all data backups should be distributed in different zones. In the recent version of Swift, a larger concept region than zone was introduced.

2.2.3 Weight

Each device in the ring introduces the weight concept. The function of weight is that if a cluster has a device capacity of 1T. Some are 2T, 3T, for non-common devices, when storing data, we certainly hope that the capacity of larger devices has many other opportunities to be chosen. Stores many other objects by setting the weight. Devices with more storage capacity have greater weight. It also has a larger part_wants, then there will be a lot of other virtual node and its mapping, there are many other virtual node mapping, the object will be more likely to fall on its corresponding virtual node, so there will be many other opportunities to get objects.

2.2.4 Data Migration Time

Ability to set minimum migration time for data when deploying Swift as in my last blog, 1 hours indicated in the SWIFT deployment

Swift-ring-builder Object.builder Create 18 3 1

Application of 3 consistent hash algorithm in ring

The last two sections describe the ring's rationale and the principle of a consistent hash algorithm, and now explains how the ring uses a consistent hash algorithm.

In the ring there is a Replica2part2dev (backup to partition to device mapping),Replica2part2dev is containing replica (replica is a positive integer. A list of sub-lists, each of whichcontains 2**power elements. when the float value of replica is >1, there will be a int[replica]+1 sub list, and the last list length is (Replica-int[replica]) *2**power. For example, when deployed in 2.2.4, the power indicated is 18. The replica is 3, and the ID of the device is stored in each element.

For example, the power shown below is 8. Replica is 3. 3 devices and in three zones, the replica2part2dev,0,1,2 that are obtained after Reassign_parts is the ID of the device.

[[2, 2, 0, 0, 2, 0, 0, 2, 1, 0, 0, 0, 2, 0, 1, 2, 2, 0, 1, 0, 0, 0, 1, 2, 1, 1, 1, 2, 0, 2, 2, 0, 1, 0, 2, 1, 1, 0, 2, 0, 2, 1, 0, 1, 1, 0, 1, 1, 0, 2, 0, 1, 1, 1, 2, 2, 0, 2, 1, 1, 2, 2, 2, 2, 1, 1, 0, 2, 1, 1, 0, 2, 0, 0, 1, 1, 1, 2 , 1, 0, 2, 0, 1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 0, 2, 1, 2, 0, 1, 1, 2, 2, 0, 2, 1, 2, 2, 0, 1, 1, 0, 0, 2, 1, 0, 2, 0, 2, 2, 0, 1, 0, 2, 0, 1, 1, 1, 0, 2, 2, 1, 1, 0, 0, 1, 1, 0, 1, 2, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 2, 2, 2, 0, 0, 1, 0, 0, 0, 0, 2, 2, 2, 2, 0, 1, 0, 1, 2, 2, 1, 1, 1, 1, 0, 2, 2, 1, 0, 1, 0, 2, 2, 1, 2, 1, 2, 0, 2, 2, 0, 2, 0, 1, 0, 1, 2, 1 , 1, 0, 2, 1, 2, 2, 0, 0, 0, 1, 2, 2, 0, 0, 2, 1, 2, 1, 1, 1, 1, 1, 2, 2, 0, 1, 1, 0, 1, 2, 2, 2, 0, 2, 2, 0, 1, 2, 1, 2, 0, 0, 2, 0, 0, 1, 1, 0, 2, 2, 2, 0, 1],
[1, 1, 1, 1, 1, 1, 1, 0, 2, 2, 1, 2, 1, 1, 0, 1, 1, 2, 0, 2, 1, 2, 0, 1, 2, 0, 0, 0, 1, 0, 1, 1, 2, 2, 1, 0, 2, 2 , 2, 1, 0, 0, 1, 2, 2, 2, 0, 2, 2, 0, 0, 1, 0, 2, 1, 2, 2, 1, 1, 0, 0, 0, 2, 1, 0, 2, 2, 2, 1, 1, 1, 0, 0, 2, 0, 2, 2, 2, 0, 2, 0, 2, 2, 0, 0, 0, 0, 2, 2, 1, 0, 0, 0, 1, 1, 0, 1, 2, 0, 2, 1, 1, 1, 0, 0, 0, 0, 1, 2, 2, 2, 2, 1, 2, 2, 1, 2, 0, 1, 1, 0, 1, 0, 2, 0, 0, 2, 2, 1, 0, 0, 2, 1, 1, 2, 0, 2, 0, 1, 2, 2, 1, 2, 1, 0, 1, 1, 1, 2, 0, 1, 1, 1, 1, 2, 1, 1 , 0, 0, 1, 2, 0, 1, 2, 1, 1, 2, 0, 2, 0, 1, 1, 1, 2, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 2, 0, 2, 0, 2, 0, 0, 2, 2, 1, 1, 0, 2, 1, 2, 1, 1, 2, 2, 2, 2, 1, 1, 1, 2, 1, 2, 1, 2, 0, 0, 0, 2, 0, 1, 2, 2, 2, 2, 0, 1, 0, 0, 2, 1, 1, 1, 2, 1, 0, 0, 2, 1, 0, 2, 2, 0, 2, 2, 1, 1, 1, 1, 2],
[0, 0, 2, 2, 0, 2, 2, 1, 0, 1, 2, 1, 0, 2, 2, 0, 0, 1, 2, 1, 2, 1, 2, 0, 0, 2, 2, 1, 2, 1, 0, 2, 0, 1, 0, 2, 0, 1 , 0, 2, 2, 2, 2, 0, 0, 1, 1, 1, 0, 2, 2, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 2, 0, 2, 1, 0, 0, 1, 0, 2, 2, 2, 2, 0, 1, 1, 1, 1, 2, 1, 1, 1, 0, 2, 2, 2, 1, 0, 0, 0, 1, 2, 1, 2, 0, 2, 0, 1, 2, 0, 0, 0, 2, 1, 2, 1, 1, 2, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 2, 2, 2, 1, 1, 2, 2, 0, 1, 0, 1, 2, 0, 2, 2, 0, 2, 1, 2, 0, 1, 1, 2, 1, 2, 2, 0, 0, 0, 1, 1, 0, 0, 2, 2, 0, 2, 2 , 1, 1, 0, 1, 2, 2, 0, 0, 0, 0, 2, 0, 2, 2, 0, 0, 0, 2, 2, 2, 0, 1, 2, 0, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 2, 1, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 2, 1, 0, 0, 0, 0, 2, 2, 2, 0, 1, 0, 1, 0, 0, 1, 2, 0, 1, 1, 1, 0, 0, 2, 0, 0, 2, 1, 1, 2, 1, 1, 1, 2, 0, 1, 0, 0, 0, 2, 0]

After getting Replica2part2dev, when the object needs to be stored, it is first account/container/object according to the object's name. The corresponding hash value is calculated. Take the first 4 bytes (32 bits) of the hash value. Move it to the right 32-power bit. The value is the value of partion, the ID of the corresponding device is removed from the partion, the device's other properties are returned according to the device ID, and the device's IP address is included in the other properties of the device. Port and other information, through the HTTP request, and three devices to establish a link, because three devices executed object-server, they receive these requests, the data is deposited into the device. The execution mechanism of the Saw ring

Fig. 6 The execution mechanism of the ring

References:

1, http://www.ibm.com/developerworks/cn/cloud/library/1310_zhanghua_openstackswift/

2, Http://blog.csdn.net/sparkliang/article/details/5279393?reload

Openstack_swift Source Code Analysis--ring Basic principle and consistency hash algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.