The story of cloud storage-return of metadata

Source: Internet
Author: User

Metadata return

Mo Huafeng

Cloud storage service is an important part of cloud computing. Technically, cloud storage is a large-scale distributed online storage. Cloud storage is a special type of shared storage. As a service that provides storage resources, cloud storage must ensure that the data stored by users is reliable and never lost. At the same time, cloud storage must be online in real time, and any downtime will cause losses to users. Therefore, the basic requirements of cloud storage are high reliability and high availability. In addition, cloud storage is a massive storage of massive data. In addition, for cost and cash traffic considerations, the size of the cloud storage cluster must expand with the increasing user data volume. The architecture, design, and technical application of cloud storage are based on these four basic requirements. On the contrary, no matter how beautiful and advanced technology, as long as it may affect the implementation of these goals, it cannot be applied to cloud storage.

When I started to get started with storage, consistent Hash (and the famous DYNAMO) was a very popular technology. Technically consistent hashing is very elegant, concise, and efficient. However, in practical application, it is another manifestation. This article will compare and analyze the centralized metadata storage solution and consistent hash, in order to show that metadata is more suitable for cloud storage.

1. Object Storage, Block Storage

Practical cloud storage can be divided into two types: Object Storage and block storage. Object Storage Service (OSS) is an authentic data warehouse that only stores key/value data. A user has a data object and needs to store it. Then, the user gives the object a name (key ), save the object together with its name to object storage. When necessary, use this name as the key and ask for it from the storage system. The object storage system must return the data to the user as needed, unless the user has deleted the data from the storage system.

Block Storage acts as a block device under the operating system (in general, it is a disk) for the system to use. Block Storage is actually a storage attach network, which allocates the storage space of the cluster to users and mounts it to the operating system for disk use. Because block storage needs to simulate disk behavior, it must ensure low latency.

Although the two types of cloud storage have completely different goals, purposes, and features, they all face the same problem in terms of basic features of distributed storage. The discussion here makes sense for both. For convenience, we will only discuss the situation of Object Storage. However, a lot of content and conclusions can be pushed out to block storage.

2. Storage Basics

The cloud storage function is very simple, just to store user data. But it is simple, and several key points still need to be done. When you upload a key-value pair to the storage, the storage system must find a suitable server to save the data. It is usually stored on multiple servers to prevent data loss. This is called multi-copy.

Therefore, a key question is how to choose a server to store data. The choice of servers is very technical. There are several key points to consider: First, data must be balanced between servers. Data cannot be concentrated on a few servers, resulting in the death of some servers, while the other part starved to death. Second, you can easily and quickly locate the data when reading it. Subsequently, it satisfies the high reliability, high availability, and large-scale features of cloud storage services. Finally, try to be as simple as possible.

Therefore, each object has a key-to-data storage location ing: Key-> POS. There are many ing methods. The most direct one is to save the key-> POS data pair of each object. This data is often called "metadata ".

However, there are some more clever ways to divide the key space into several groups based on the characteristics of the key and map these groups to different storage nodes. This method can be a general "sharding ". In this way, you can directly locate the server according to a simple rule. One of the common grouping methods is to divide data by key intervals. For example, a group starts with a, and B starts with a group. Another grouping method that is more "modern" is to modulo the key after hashing. The hash scheme is actually a natural extension of the hash table, which distributes buckets to multiple servers.

The two ing methods are essentially ing at different granularities. "Metadata" is at the object granularity, while sharding is at the granularity of a group of objects. These two different granularities determine that they have completely different features. It also determines their performance in practical applications.

3. Metadata and consistent hash

Therefore, the cloud storage solution has two major genres: the metadata model and the sharding model. Consistent hash is the most popular among sharding models. Consistent Hash itself is difficult to use directly for actual use, resulting in many derivative schemes, including the famous "dynamo ". Here, the "consistent hash Scheme" is used to refer to all designs based on consistent hash.

The metadata scheme is an object-level key-> POS ing, that is, a "map" that will grow endlessly ". Each time multiple objects exist, one more metadata is displayed. Metadata is usually stored in a group of databases for easy retrieval and query. There is nothing special about the metadata scheme, and its core is the metadata storage part. The design of this part is related to the overall characteristics of the system. The Design of metadata storage is not the focus of this article. This article will focus on the comparison of metadata solutions and consistent hash solutions.

The standard consistent hash model is used to hash keys and then project them to a circular numerical space. At the same time, the node (Storage Server) is encoded and then hashed and projected to the hash ring. Theoretically, as long as the hash algorithm is appropriate, nodes can be evenly distributed in the hash ring. A node occupies a hash value range based on its position in the hash ring, for example, the interval between the current node and the next node. All Keys falling into this interval are saved to this node.

In this model, key-to-data storage logical location ing is obtained directly through algorithms rather than through storage. However, the conversion from logical location (location on the hash ring) to physical location (node) cannot be obtained directly. The standard practice is to select a node, and then search for the target node in sequence, or use the binary method to jump between nodes. This search method is intolerable in the actual storage system. Therefore, a practical storage system usually uses a hybrid model (now called a "hybrid solution"): The system maintains a hash range-> node ing table. This ing is essentially a kind of metadata, but it is a large-granularity metadata, more like a route table. Because of the large granularity and few changes, you can use text files. This ing table also requires multiple copies, which can usually be stored on the Entry Server or distributed to all storage nodes. The key is to keep them consistent, which is relatively easy.

Consistent hash solves the problem that the standard hash table needs to be re-computed when it is changed to an hour and all data is migrated. You only need to migrate the data contained in the hash range occupied by the newly added node. However, with the addition of new nodes, the data volume distribution will become uneven. To solve this problem, many models, including dynamo, all adopt the "virtual node" solution: dividing a node into several virtual nodes. When a node is added to the system, the virtual node is distributed to the hash ring, so that a node can be added more "evenly.

Consistent hash and its derivative scheme. Data is sharded according to certain rules, and the sharded data is mapped to the corresponding storage server according to certain rules. Therefore, it only needs to maintain an algorithm or a simple ing table to directly locate the data. It is simpler and has better performance because metadata is not queried.

However, this is only theoretical.

If we only consider the storage function (get, put, delete), consistent hash is perfect. However, the actual situation is that the cloud storage architecture is dominated by non-functional requirements:

1. Large Scale and scalability

2. reliability and consistency

3. Availability and manageability

4. Performance

In the actual cloud storage system, non-functional requirements lead to the architecture and design. It also plays a decisive role in selecting key-pos ing.

4. Scale and Expansion

First, the most obvious feature of cloud storage is its large scale. For a public cloud storage service, there is no limit to accommodate user data. It is impossible to deny the user to store data on the grounds that the "capacity is full. Therefore, cloud storage is "unlimited. That is to say, the cloud storage system must be able to expand at will at any time without affecting the service.

On the other hand, cloud storage is a service from small to large. At the beginning, thousands of servers were deployed with dozens of Pb capacity, which is meaningless. Resources will be idle, resulting in waste, which has a great negative impact on cost and cash flow. Generally, we only deploy a small-scale system to meet initial capacity requirements. Then, scale up according to the demand growth.

Therefore, the cloud storage system must be highly scalable.

In the face of expansion, there is no difficulty in metadata solutions. When a node is added, the system can more direct new write requests to the newly added node. A proper scheduling balancing policy ensures that the storage space of each node is used in a balanced manner.

However, consistency hashing and its derivative schemes are much more troublesome. Once the original hash table needs to be resized, rehash is required. In a hash-based distributed storage system, this means that all data must be migrated again. This is of course intolerable. The emergence of consistent hash can solve this problem. Because the object and server are mapped to the hash ring, when a new node is added, it is also mapped to the hash ring. The original segment will be truncated by the new node, and the newly added node will occupy the hash value range cut out. To complete this conversion, the data needs to be migrated to the new server. Compared with the original hash, consistent hash only needs to transmit the part of data that is stolen by the new server.

However, after all, data migration is required. Data migration is not just copying data from one server to another as it seems. Cloud storage is an online system, and data migration cannot affect system services. However, it takes time to migrate data. To ensure the continuity of data access, the write operation is directed to the target server at the beginning of the migration, and the part to be migrated on the source node is switched to the read-only status. At the same time, start data migration from the source node. When you try to read data within the migration range, you need to try to read data from the source and target nodes respectively. This single-write dual-Read mode ensures that the service is not affected. However, the data migration rate must be controlled. If the Disk Throughput is full or the network bandwidth is exhausted, the service will inevitably be affected.

Another problem is that data imbalance may occur unless multiple nodes are added. To balance data, more data needs to be migrated, and each node needs to be migrated out to ensure that the data volume of each node is similar. The introduction of virtual nodes plays this role. As a result, the actual data migration volume is proportional to the increase in capacity, and the coefficient is the space usage of the current storage system.

Therefore, for a 1 P system with three copies, 70% of consumption and 200 TB capacity expansion, You need to migrate about 140 Tb x 3 = Tb of data to balance the data storage. If you use a commonly used storage server (2 TB * 12), you need to add 21 more servers. If both of them are involved in the migration concurrently and the Migration speed of a single migration does not exceed 50 Mbps, the time required for this expansion is 420 TB/(50 m * 21) = 400000 seconds, about 4.6 days. This is an ideal situation, including hardware and software exceptions, user access pressure, and post-migration checks, which will prolong the migration time. It is likely that these tasks will be spent in ten days and a half. After the data migration is complete and the storage space of the old server is recycled, it is not actually resized. So in case the system space is about to run out during expansion, (don't say this will happen. A lazy supplier or manufacturer is flooded, this can happen. If cloud computing encounters anything, the system may be suspended because it is too late to complete the expansion. Cloud storage, especially public cloud storage, must be quickly and conveniently resized.

The more complex case is that errors occur during the migration process, hardware failure, and other abnormal situations are complicated, because the data distribution is in an intermediate state, subsequent processing must ensure the security and consistency of system data.

The larger the scale of the system, the more difficult it is. When the system has a scale of P (the scale that many systems claim to be able to achieve), and user data is growing rapidly (public cloud storage has a significant Matthew effect, the larger the scale, the faster the growth ), it is a shocking scenario to migrate data from hundreds of thousands of servers to P.

Data migration will consume network bandwidth, consume disk loads, disrupt server cache, and so on. It is a big taboo for cloud storage.

The metadata scheme generally does not need to be migrated. Migration can only be performed when the storage server is replaced with the old one or when the rented server expires and is returned. Because data objects can be placed on any node, the data to be migrated on one node can be distributed to other nodes. In addition, data can be transmitted from other replicas in multiple-to-multiple concurrency mode. The load is distributed to the entire cluster, which has less impact on services and is faster. In fact, this logic is the same as that for data copy restoration.

Obviously, compared with the consistent hash scheme, the metadata scheme provides a dynamic balance mechanism that eliminates the need for data migration. Once a server is added to a cluster, it can take effect immediately, achieve calm resizing.

5. reliability and consistency

Reliability (this document specifically describes the reliability of data, that is, no data loss) is undoubtedly the foundation of cloud storage. Users trust you in data and naturally do not want to be lost at will. Maintaining Reliability is the most difficult part of cloud storage. (Of course, it is more difficult to ensure high availability while maintaining high reliability ).

In any system, hardware and software cannot guarantee full reliability. The chip will be burned out, the circuit may be short-circuited, the voltage may fluctuate, the mouse may bit off the network cable, the power supply may be interrupted, the software may have bugs, and even cosmic rays will interfere with registers .... As the core component of storage, hard disks are more vulnerable. In the standard server, apart from the optical drive and fan, the hard disk is the only mechanical and electrical component. Because of the active components, its reliability is inferior to that of Solid State circuits, and it is more vulnerable to external interference. Therefore, hard disks are often seen as consumables. In a storage cluster with tens of thousands of hard disks, it is not uncommon to break down several hard disks every week. In bad luck, two or three parts may be damaged in a day.

Therefore, it is impossible to guarantee data reliability to store data in a single disk. Generally, we store multiple copies of data on different servers ". In principle, the more copies, the more reliable. However, excessive copies will increase storage costs, reduce performance, and increase the difficulty of maintaining consistency. Generally, three copies are kept, which is a balanced number.

However, under the determined number of replicas, what really plays a key role in reliability is the restoration speed after the replicas are lost. For example, in a three-copy storage system, when a disk is damaged, only two copies of the Data Objects it carries are left. Before this disk is repaired, if another disk is damaged, and it happens to have a common data object with the unrepaired disk, then only one copy of the data is supported, this is a very dangerous situation. What's more, a 3-copy storage system, even if the hard disk is not damaged during operation, there will always be some objects in the Two-copy state for various reasons, but they have not yet been repaired. For example, when writing data, a copy is written and the verification code is incorrect. You need to rewrite the code. At this point, if a disk is damaged, there will be exactly these 2-copy objects. Therefore, from this moment on, this object has only one copy. Before the data on the bad disk is repaired, the hard disk of the other disk containing the object is also broken, and the data is lost. Although this probability is very small, the cloud storage system runs over years. In addition to a large scale and a long time, any small probability event will happen. In addition, in the actual operating system, many factors may cause the hard disk to have a longer life than the theoretical value, such as rack vibration, unstable power supply, unexpected power loss, and radiation. In addition, for hard disks purchased in batches, the failure period is usually concentrated in a period of time, and the probability of damage to multiple disks at the same time will be greatly increased.

If the data recovery speed is fast enough, you can repair lost copies before another disk is damaged, the probability of data loss will be greatly reduced. (Strictly speaking, no matter how short the restoration time is, the probability of the second disk being damaged always exists during the restoration period, but the shorter the repair time, the smaller the probability. The key is to make it as small as we can accept ).

In the consistent hash scheme, if a disk is bound to the same hash interval one by one. Therefore, when restoring data, you can only replace the bad disk, read data from other copies, and write data to the new disk. However, the disk's continuous write capability is usually only 50-60 Mbps. If a hard disk with 30000 TB of Data fails, it takes at least seconds to recover the disk, nearly 9 hours. Considering the overall server load and network conditions, the time may be close to 12 hours. In this time scale, the probability of damage to the second or third disk is quite high. In addition, the hard disk replacement time depends on the response capability of the data center management, which is often a weak link.

If the scheme allows the nodes to correspond one to one with the hash interval, and the data is distributed to a disk within the node, when the copy needs to be restored, nodes store corrupted data objects on disks to other disks. In this way, you can initiate a repair, recover data from other copies to multiple disks concurrently, and the Disk Throughput limit will be mitigated. However, the problem is not completely solved. First, the network will become a bottleneck. One gigabit network port can support up to 120 Mbps, which is only twice faster than the disk, and the bandwidth cannot be used up in actual use. Otherwise, the service will be affected. After all, other hard disks are still good, it also needs to work properly. In principle, you can increase the number of network ports to increase the throughput, but this increases the cost of network equipment. In addition, this only increases the throughput by three or four times. What we really need is to reduce the recovery time by nearly 10 times. As for optical fiber, it is not what ordinary public cloud storage can expect. Even if the network throughput problem is solved, there is another core issue. Because data is randomly distributed to disks in the node, the node needs to maintain a key-> disk ing, that is, the local metadata in the node. This metadata needs to be maintained by the server itself. due to limited resources on a single server, it is very troublesome to maintain the reliability, availability, and consistency of metadata. It is troublesome to store a set of metadata, not to mention a set of metadata for each node. Once this metadata is lost, all the data on this node cannot be found. Theoretically, this metadata is lost and does not affect the global data. You can use the consistent hash algorithm to restore the lost data to other copies. However, we have to transmit all the data on the original server. The data volume is usually 10 or 20 TB, making it more difficult to recover. A more practical approach is to scan disks directly and rebuild metadata in reverse order. Of course, this will be a long process, during which the entire node is unavailable, and the write operation in this case still needs to be restored after the event (For details, refer to the "availability" section in this article ).

The fastest way to recover a copy is to prevent the copy of the data object from being bound to one node. As long as the data object can be stored on any node, you can recover multiple data copies between nodes. The larger the cluster Scale, the faster the speed, the better the effect. The metadata-based solution allows you to map all objects to the same node or even the hard disk. It has unique advantages in copy recovery. Consistent hash strictly binds data to the same node or disk, forcing such concurrency to fail.

There are some derivative schemes based on hybrid schemes that can solve the speed problem of consistent hash in replica Restoration: dividing the hash ring into several slots (buckets, or other similar titles ), the number is far greater than the number of nodes or disks that the cluster may have in the future. (Well, we have said that the scale is infinite, and it is impossible to be infinite, as long as it is large enough, such as 2 ^ 32 ). The sloing between slot and node or disk is maintained through a ing table. The slot distribution in each replica cluster is different. When a disk is damaged, find the contained slots and distribute these slots to other nodes and disks. This allows you to concurrently recover copies from other nodes in the unit of slot. After the damaged disk is replaced, some slots can be original or random and migrated to the new hard disk to balance the overall data distribution. At the time of migration, the copy has been restored, and the Migration operation time is under little pressure. It can be done slowly without affecting the Disk Throughput bottleneck.

However, in contrast, the metadata solution does not migrate data after the replica is restored. In this respect, the existence of Object-level metadata makes the replica recovery much easier.

Consistency is a problem related to data reliability. Consistency can be seen as a part of reliability issues. When the data versions of each copy are inconsistent, it means that the current version of the Data Object lacks a copy. (In fact, from the storage perspective, different versions of a data object are different data ). The most practical method to ensure the consistency of a data object is W + r> N. That is, if one of the N replicas is successfully written to one version, W are successful, and R is successful during reading, if W + r> N is met, you can always read the version successfully written.

There is a problem with the use of W + r> N. You need to concurrently read data from all copies and then compare the versions (or timestamps) of the read data objects, to determine whether the consistency formula is met. If the read data contains dozens of hundreds of MB or even GB objects, read all the copies in one breath, and finally only take one of them. This is a waste, and the system pressure will be n times larger. The solution is to first perform a pre-read, read the version information of all copies, perform consistency comparison, determine the valid copies, and then read the data itself.

The metadata scheme is much simpler in terms of consistency. Metadata also uses multiple copies to ensure reliability. Because the metadata is small, more copies can be maintained, such as five or even seven. So many replicas do not have to worry about their reliability. The focus is on consistency. The W + r> N policy is also adopted, but metadata reading and consistency are ensured in one access. For data storage servers, the task is to ensure the integrity of each copy and version.

Data may degrade over time, resulting in the loss of copies for various reasons. The same is true for consistency. Hot data is frequently accessed, and storage data errors will soon be discovered. However, cold data needs to be detected through regular checks. In the metadata solution, this check is to compare the Object List on each disk of the metadata and the node. The metadata is always saved in the latest version. If there is no matching, an error can be identified, fix it immediately. However, in the consistent hash scheme, you need to cross-check the list of objects contained by the three copies in a hash interval to determine which object is the latest copy. Then modify the data.

When a node is deprecated for various reasons, all write operations during this period cannot be completed. In this case, the metadata solution is easy to process: select an appropriate node, write a copy, update the corresponding metadata item, and the operation is complete.

The consistent hash scheme is much more complicated. A copy of a hash interval is fixed on a node. In other words, a group of copies must be stored on a specific node and cannot be placed at will. If you want to locate another node, you must migrate the entire range. The result of this feature is that when the node is offline, it cannot write the corresponding copy or write the copy to another node at will. There are two methods for subsequent processing: 1. Write the copy or key to a queue. After the node recovers, initiate the repair operation, from other copies, and add the missing copies. The problem is that the queue must have sufficient reliability; otherwise, the key to be repaired is lost, and the corresponding object will be missing a copy for a long time until the data consistency detection finds the problem. This will increase the pressure on consistency detection, making the original complex process worse. 2. Write data to other nodes according to certain rules, and migrate it back as it is after recovery. This solution is relatively simple, but the scheduling logic is complicated, involving coordination between data nodes. However, such point-to-point data recovery puts pressure on the Temporary Storage server, which is not conducive to stable operation. In either case, data migration or replication is inevitable. The exception handling and load control in the middle all require a great deal of effort.

The metadata scheme is much simpler to process when the node fails than the consistent hash scheme. Node failure is the norm of cloud storage. The simpler the processing method, the better. In a large cluster, it is common for several nodes to go offline at the same time. Such a complex node failure management with consistent hashing is the hell of O & M.

6. Availability and manageability

Availability has a common solution with reliability in some aspects. For example, multiple copies can eliminate single points and improve availability. However, they have conflicts in other aspects. When a copy fails to be written, try again in terms of reliability, or simply tell the user that the write has failed. However, this will inevitably cause slow response or lower availability. (After the response slows down to more than one degree, it is considered a failure, regardless of whether the response is successful or not ). In addition, many measures to ensure reliability, such as replica repair and data migration, consume system resources and affect availability.

In the above analysis of reliability, we can see that, because the copy is bound to a specific node, consistent hash must ensure the reliability of the same metadata, so we have to give up some availability, an error occurred while writing the copy. Because there is no limit on the data storage location of the metadata scheme, you can correct the failure of writing a majority of copies by reselling the server. Consistent hashing is not convenient, either giving up certain availability or taking the risk of reliability.

Basically, consistent hash creates a partial implicit single point in copy writing. Although it is short-term or temporary, it still has an impact on the system. A well-designed distributed system will minimize the possibility of single point of occurrence, which is directly related to the continuous and instantaneous availability of the system. The metadata scheme ensures that data distribution does not have any form of spof. For an object, no node has its own particularity. Such non-differentiation can truly eliminate single points of failure.

In the cloud storage system, using the formula R + W> N is the core point that affects system availability. Availability and consistency are often the same. In this case, requests must be sent to N servers at the same time, and the final Validity depends on the server feedback. Generally, the larger N, the easier it is to maintain availability while ensuring consistency. (Actually depends on the value of X = R + W-N, the bigger the better. This value indicates that when x + 1 copies are removed or lost, the system cannot guarantee temporary or permanent consistency ). However, the smaller the N-R and N-W, the greater the impact on availability. If n = R, as long as one server is offline, the system cannot be read. If the N-R is 3, the system reads correctly even if three servers are offline. The same applies to W. Here, consistency and availability are a conflict. When the number of replicas is large enough (N can be large), a high number of X can be easily obtained. For example, n = 7,
R = W = 5, x = 3, N-R = 2, N-W = 2, means that even if two servers go offline, it can ensure read and write effectiveness, while ensuring consistency.

However, if n = 3, r = 3 or W = 3 can be used in any case. In this case, although the consistency level of X = 3 can be obtained, even if a server is deprecated, it will also cause system unavailability. If R = W = 2, it can ensure that one server is offline and the system is still available, but the consistency level is reduced to x = 1.

In the metadata scheme, the number of copies (n) of the metadata can be larger, and the number of copies caused by faults and exceptions may be dropped or lost, which has little impact on the availability and consistency of the system, the urgency of handling such problems is lower. On the contrary, consistent hash consistency depends on data storage nodes. In the face of large and small data objects, many copies cannot be used, usually 3. Under this N number, consistency and availability are all very fragile.

The last and most important aspect of availability is O & M management. The quality of O & M directly determines availability. As mentioned above, consistent Hash has many more links than metadata solutions in some key system maintenance points. Consistent hash is more prone to availability problems under the same O & M level and intensity. In this case, it is difficult to give a clear quantitative evaluation of the impact of availability. However, O & M is the core of the cloud storage system's high availability assurance. The complexity of O & M usually determines the final number of 9.

Now let's take a look at the O & M features of the metadata solution. Compared with the standard consistent hash scheme, the metadata scheme has an additional metadata storage system. Metadata usually has more copies. The larger the number of copies, the more difficult it is to maintain consistency. The general scheme relies on the version synchronization mechanism of asynchronous execution to synchronize copies as soon as possible. However, to prevent synchronization from being invalid, all metadata is periodically scanned and checked to ensure that the metadata does not degrade. This process is accompanied by a large amount of data throughput and computing workload. However, this process is performed offline without strict time requirements. Even if it fails, it will not affect the system operation too much. The time margin is relatively large. The amount of metadata is not very large. In a suitable algorithm, the consistency comparison usually takes no more than 2 or 3 hours. You can increase or decrease the number of servers as needed.

Consistent hash O & M focuses on the other end. The consistent hash model and its various variants have to perform large-scale data migration during expansion because of the need to ensure data balance. The migration speed must be limited without affecting the service, resulting in a long migration time. However, sometimes the time required for data migration is high. Before the Migration is complete, the space occupied by the migrated data source cannot be recycled. The added space is temporarily unavailable. Data Recovery also takes time. When the available space is tight, the Migration speed is very high. Once an exception occurs during the migration process, it will become worse. The possibility of disk damage during data migration for a few days is very high. The repair operation will force the data migration to slow down or even stop. The repair priority is higher. The migration operation involves the concurrency of many servers, and the coordination and control work is very complicated. There are also different migration policies for different migration reasons. In addition to various Exception Handling, there are a lot of O & M management content, which will be very painful. As the scale grows, it may eventually exceed the O & M capability limit and cannot maintain the storage availability.

The metadata scheme does not migrate data in most cases. Reducing such a complex and heavy load process reduces the system maintenance pressure.

Therefore, although the metadata scheme has an additional metadata storage cluster, it is easier to maintain availability than the consistent hash scheme.

7. Performance

Performance is a concern of most programmers and is often placed at a high level. However, in cloud storage, the performance is relatively low. Storage is an IO-intensive system. No matter how code is optimized, it is inevitably restricted by physical devices.

The primary focus of cloud storage performance is concurrency. Do a Good Job of concurrency to ensure that all requests are not blocked from each other, so that performance can be guaranteed in a large way. In the metadata scheme, metadata access has a large number of concurrent accesses. Data access requests of each user are converted into N concurrent requests. The system has to manage a large number of concurrent operations and maintain the logical relationship between concurrency with caution. Traditionally, the thread pool is difficult to meet the needs and has to adopt an asynchronous model to increase the development difficulty.

Consistent Hash has its own advantages in this respect. Because N is not large, the amplification effect of concurrency is much smaller than that of metadata solutions. Of course, as mentioned above, a small number of N will reduce the consistency margin, which is not conducive to availability.

Another important aspect of cloud storage performance optimization is to reduce the number of cross-server access requests. There must be a latency for cross-network access. The more times there are, the greater the overall latency. It also increases the network burden. The metadata scheme has a natural metadata retrieval operation, so it is at a disadvantage in this respect. However, consistent hashing is not good at reading objects. As mentioned above, in order to execute the logic W + r> N and not read all copies, only one concurrent pre-read operation can be performed. In this way, the cross-server access times of consistent hash are flattened by the metadata solution. Only a few concurrent threads. This is at the cost of reducing fault tolerance.

The primary disadvantage of metadata in terms of performance is the access to metadata. The system accesses metadata mainly for retrieval operations. Therefore, databases or data storage engines with retrieval functions are usually used. These modules have performance restrictions in case of heavy load. Because of the large number of replicas, it is costly to simply expand the number of servers. Fortunately, the optimization of these parts, especially database optimization, has very mature experience and can be done well with some effort.

8. Others

In terms of functions, consistent hash also has a disadvantage. Although consistent hash can implement basic functions of object storage, the storage service sometimes needs to provide some additional functions, such as listing the object keys saved by a user. If metadata exists, the user information can be saved in the metadata and can be searched as needed. However, in the consistent hash scheme, all keys have been hashed in the system and distributed to all servers. If you want to obtain a list of user objects, you have to perform the corresponding retrieval operations on all storage nodes. Imagine a storage cluster with thousands or even tens of thousands of nodes, what is the result of executing such an operation?

For users who store a large number of data objects, listing objects may have no direct significance. Users may not be able to accept this feature. However, problems related to another user cannot be avoided. This is billing. Billing is to calculate the capacity occupied by all objects of a user, and then create a billing list for the billing system.

The simplest billing method is to accumulate the user's access records, add the write records, and delete the logs. But it is not that simple. For overwrite of a key, you must first deduct the size of the original object and then add the size of the new object.

More importantly, as access records may be incorrect or lost, this relative calculation method will generate cumulative errors as time increases. Regular calibration is required. The calibration method is to calculate the object size of all users and calculate the absolute capacity. Because the object size is stored in all data storage nodes, this is a terrible full-system data scan. At the same time, the consistency of Object Attributes must be ensured to avoid billing errors.

Billing involves money. Users are usually very sensitive and should try to avoid errors. Calibration should be as frequent as possible. However, the cost of billing and calibration for the entire cluster is too high, so we should try to do it as little as possible. Therefore, how to reconcile these contradictions poses a huge challenge for developers and O & M personnel.

In the metadata scheme, the information required for billing can be stored in metadata, and the billing operation is limited to metadata. The metadata volume is small, and data can be directly dumped and calculated on an additional server. This operation is relatively lightweight and can be executed more frequently to improve the accuracy of user bills.

9. Summary

After a general analysis, we can see that consistent Hash has the advantages of simplicity and efficiency when only cloud storage needs are taken into account. However, the functionality of a large cloud storage system is not the main factor affecting its architecture. Non-functional requirements, including scale, scalability, reliability, availability, and consistency, are the key considerations for cloud storage. Consistent hashing and various deformation schemes have many defects in these aspects. Although defects in each aspect are not insurmountable obstacles, they can often be solved through some design and O & M measures. However, the system complexity and O & M difficulty increase. Most importantly, the superposition effect produced by the integration of all aspects will easily break down a cloud storage system.

In general, consistent hash is suitable for some scenarios where the data size is small and there is no need for expansion. These features mean that consistent hash cannot be well applied to large cloud storage systems. Although the metadata scheme is more complex in architecture, it has the advantage of being flexible. This flexibility is more suitable for cloud storage.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.