One of the most important aspects of using and designing a distributed storage system is data addressing, which machine (or even which disk) the data copy of a key is located on; there are several commonly used solutions: central node Management metadata, distributed management metadata, non-meta data design This paper, with its own experience, discusses the characteristics of three kinds of schemes:
1. Central Node Management metadata : In the design of distributed (storage) system, the use of central node is very brief, clear a scenario, the central node usually has a meta-data storage and query, cluster node state management, decision-making and task issued functions;
Advantages:
A. Due to the characteristics of centralized management of metadata, it is convenient to deal with the statistical analysis of cluster operation and maintenance management.
B. The central node records the status information of the user data (that is, metadata), in the expansion, you can choose not to do the rebalance operation (rebalance caused by the data migration can bring a huge performance cost), and still able to properly address;
Disadvantages and Solutions:
A. Single point of failure is one of the most taboo problems in designing distributed systems, and the simple design of central node brings this problem, how to implement ha? ; Solution: (1) using the primary and standby model, the primary and standby use synchronous or asynchronous way for incremental or full data synchronization (such as tfs,mfs,hdfs2.0, etc.), or between the primary and standby using remote shared storage (such as HDFS2.0, Remote Storage needs high availability);
B. There is a limit to the performance and capacity expansion, the centralized central node of its own hardware facilities there is an extension (scale up) upper limit and query-type addressing mode, causing this problem, even if the client cache metadata or use a cache cluster, can not fundamentally eliminate the upper limit, in some scenarios (such as a large number of small files), The problem persists; solution: (1) Optimize the upgrade hardware, such as the use of SSDs, large memory and other machines; (2) When faced with this problem, consider using a distributed management metadata scheme.
2. Distributed Management metadata : Similar to the scheme of the central node, just shard the metadata and use distributed node management storage, while maintaining the advantages of the central node scheme, solve the problem of the performance and capacity expansion limit, meanwhile, multiple nodes provide metadata query service at the same time, the system performance is improved. ;
Disadvantage: This kind of system is relatively rare (I know of: commercial Distributed File System hdfs-federation, open source system), its own system structure is complex, the realization also has a certain difficulty;
A. The system contains two relatively independent distributed nodes: metadata nodes, data nodes, they are all with state nodes, each node composed of distributed modules are faced with the trade-offs of the principle of distributed cap, all should be extensible, especially the metadata for consistency has higher requirements;
B. Metadata nodes need to co-maintain the state of the data nodes and make consistent decisions when the state changes, which poses a great challenge to the design and implementation of the system;
C. In addition, a large amount of metadata required for storage devices is also a non-negligible cost overhead;
The above two schemes have the common thought: record and maintain the state of the data (that is, metadata), the data is addressed to the metadata server first query, and then access the actual data;
3. No metadata design (mainly for ceph): different from the above two ideas, the main idea of such systems: the use of algorithm computing addressing, one of the input parameters of the addressing algorithm is a cluster state (such as data node distribution topology, weight, process status, etc.) some form of description, This kind of common algorithm has consistent hashing,ceph Rados system crush algorithm, this kind of algorithm usually does not manage the user data directly, but introduces the intermediate layer logical shard structure (for example consistent hashing ring fragment, Ceph's placement group), its granularity is larger, its quantity is limited and relatively fixed, the user accesses the data to belong to the only one Shard, the system manages to maintain these shards in order to manage maintains the user data, this kind of system also has the central Configuration Management node (such as Ceph Rados monitor), only provides management and maintenance of important state such as cluster and Shard, and does not provide metadata storage query;
Advantages:
A. As mentioned above, the system simply manages the maintenance of logical shards and cluster state information, does not store the metadata of the management user data, the system scalability is greatly enhanced, which is particularly evident in a large number of meta-data scenarios;
B. The parameter data required by the addressing algorithm is small and relatively fixed, and the client can achieve the purpose of several client parallel addressing through caching, avoiding the addressing performance bottleneck;
Defect Analysis:
A. When the cluster expands (even when the weight changes), need to do rebalance, especially the large data size (petabytes above) of the cluster, resulting in a large number of data migration so that the cluster has been in a state of high load, so that the normal service request delay, IOPS and other performance indicators decline But some scenarios do not want to do rebalance when the cluster expansion, such as the cluster capacity is not enough, for this, the common strategy is that each cluster pre-performance, capacity evaluation, need to expand, the direct new cluster (see Yahoo share); If a single cluster must do rebalance, Reduce cluster load by manual intervention; As for the root cause of rebalance, I think the expansion leads to the change of the cluster state, which leads to the change of the address algorithm result, and the final data distribution needs to be changed.
B. The location of the copies of the data is calculated by the addressing algorithm, the position is relatively fixed, almost not artificial adjustment, but usually can change the weight of the way the overall distribution of data;
C. The central Configuration Management node only manages Shard information, does not know the information of individual user data, the requirement of statistical analysis class needs to be realized by collecting data node information periodically, and storage maintenance.
Summary: Through the above comparative analysis, the three types of system addressing strategy, so that the system itself has its own corresponding advantages and disadvantages, they are not perfect, but all have its suitable scene and business, in the system design and selection, need to do a comprehensive consideration.
Reference article:
http://blog.csdn.net/tiankai517/article/details/44599291
http://blog.csdn.net/liuaigui/article/details/6749188
http://shitouer.cn/2012/12/hdfs-federation-introduction/
Http://www.infoq.com/cn/interviews/interview-with-yangjintao-talk-open-source-storage-scheme
------------------------------------
Personal original, reproduced please indicate the source.
Thinking of metadata management of distributed storage System