Distributed storage is relative to the single-machine storage, the reason is to be distributed naturally because of the internet era of information data explosion, single-machine has been difficult to meet the needs of large-scale application data storage.
Storage-System concerns
Regarding storage systems, we generally focus on the following areas:
- Data distribution and load balancing
- Reliability and consistency of data storage
- Data access Performance
- System fault-tolerant capability
- System expansion Capability
There is a redundant array of independent disks (raid,redundant array of independent disks) technology in a stand-alone storage system,
Is the method of storing the same data in different places on multiple hard disks. By placing data on multiple hard drives, input and output operations can be overlapped in a balanced manner, improving performance.
This technique basically solves the first three points we mentioned above, the data can be distributed evenly on multiple hard disks through the disk array control program,
To achieve load balancing and to ensure reliability through redundancy. Similar to mounting multiple disks on a single machine, redundant copies of data on a disk array are consistent and easy to maintain.
The access performance of the storage system is largely constrained by the performance of the disk, which is achieved by spreading to multiple disks.
The real problem is the following two points:
A disk array solves the fragility of a single disk, but it does not improve the overall availability of the storage subsystem, or the ability to fault tolerance.
Similarly, scalability is constrained by the physical expansion slots of the disk array.
Definition and classification of distributed storage
So distributed storage comes into being, as a storage system it also needs to face the above problems.
First look at its definition:
Distributed Storage System is a large number of ordinary PC servers through the network interconnection, external as a whole to provide storage services.
From the above definition, more often than not, we use distributed storage as a service for a variety of different data storage needs.
From the data storage model, we can further classify the distributed storage service as:
- file Model: corresponding to Distributed file systems such as GFS, HDFS
- relational Model: corresponding to distributed database systems such as: Google spanner, Taobao oceanbase
- Key-value models: Many NoSQL systems are used, such as: Redis
The acquisition and loss of distributed storage
Facing the two difficulties faced by stand-alone storage system, the distributed storage System expands to hundreds of or even thousands of cluster scale to solve the system expansion capability.
The hardware fault tolerance of the single server can greatly improve the fault-tolerant ability of the whole cluster through software level.
In obtaining these benefits, there is a natural sacrifice, and the so-called gain must be lost.
When it comes to storage having to mention the transaction characteristics of a stand-alone database store: A (atomicity) C (consistency) I (isolation) D (persistence),
When extended to distributed storage, constrained by the theory of distributed C (consistency) A (availability) P (partition tolerance), it is almost impossible to meet the full transactional characteristics.
Various distributed storage service implementations have made trade-offs for the transaction characteristics of single-machine storage to meet specific service scenario requirements.
In addition, distributed storage System is based on the network interconnection, so in addition to the basic disk access performance overhead, but also more network performance overhead.
Usually mechanical hard drive average seek time is 10ms, and the computer room intranet network access overhead is generally less than 0.5 ms, relatively small loss of performance.
and the Magnetic Array control program is responsible for the distribution of data in the disk, load balancing strategy and consistency assurance,
In distributed storage, software needs to be considered at the whole machine cluster level, and the complexity is greatly improved.
Summarize
Recently, we are going to make a comb about the domain knowledge of the backend distributed architecture design, and form a perfect knowledge system.
The above overview of distributed storage services, the classification of distributed storage services and its architecture design concerns.
The follow-up will be further expanded into specific types of distributed storage Service Architecture Design essentials and implementation details.
Reference
[1] Yang Shunhui. Large-scale distributed storage system. Mechanical Industry Press (2013-09), pp. 7-52
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Back-end Distributed series: Distributed Storage-Overview