Distributed database Learning--Distributed database design _ Distributed Storage

Source: Internet
Author: User

As a complicated subject, the Distributed system computing needs the readers to have good basic knowledge of computer. With the support of these theoretical knowledge, we can better read, learn the current mainstream distributed systems and frameworks. The author recently began to learn the knowledge of distributed database with interesting. Because the author's main research direction is HDFs, that is, distributed storage, so learning to understand this piece of content for the author can contribute better in the future HDFs. OK, back to the point, the author recently learned the design of distributed database, in other words, the distributed database in the initial design to consider what factors. With this in mind, we will have a general idea of the distributed database.

The origin of distributed database

The origins of distributed databases are due to the growing scale of data and the complexity of business usage scenarios. In the face of massive data scale, the traditional single, centralized management of the database gradually exposed many defects, it will gradually reach a bottleneck point, so we have a distributed database concept. A distributed database, as its name suggests, is stored in a decentralized way on each node. If the volume of data becomes larger, it can also be extended flexibly. But a big problem of distributed database is the unified management of data. Because the data is scattered all over the place (it is not as easy to manage on a central manager as a centralized management database), we face many challenges, such as data consistency, data fault tolerance, and metadata management issues.

Design of distributed Database

The author of this paper intends to talk about some problems that need to be considered in the initial design of distributed database.

Distributed Directory Management

In a way, distributed directory management can be understood as the management of metadata in distributed cases. Because data is stored in a decentralized environment, the data stored by each node is only part of the global data. Based on this condition, the most direct view of each node should be its local data directory information. So the problem here is how to get each node to feel the presence of other data, including the location, size, and replica of the data. Only when this information is known does a single node query have access to data from a non-local node. Perhaps we would say that a simple approach is to replicate global metadata information on each node. This approach does solve the problem we mentioned earlier, but it is slightly simpler and more violent in operation. First of all, this approach overhead must be large, and it is necessary to maintain huge directory tree information on each node, especially when metadata information is growing dramatically. In HDFs, the approach adopted is a central management strategy, which is what Namenode is doing. In order to avoid the problem of single point bottleneck, HDFS introduced Namenode Federation mechanism in Namenode. Interested students can read the author before an article: HDFS Federation mechanism. There is not much to be introduced here.

concurrency control of distributed database

When the distributed database is running, it is often accompanied by a large number of concurrent transactions. Improper management can result in disorderly data execution and data errors. Here we need to consider the content of concurrency control, and the concurrency control techniques in multiple threads come in handy here. There are two kinds of concurrent control methods currently in the mainstream: 1. Pessimistic method (lock). 2. Optimistic approach (without locking). The first, pessimistic method is the way to lock, this we will also use a lot of daily work, presumably we have some understanding. The second type, optimistic locking (without locking) of the way, the general practice of this way is by adding time stamps, through the sequence of timestamps to execute the transaction. Of course, there may be a deadlock problem in the case of concurrency control. This requires us to consider whether the deadlock will occur when the program is implemented.

Reliability guarantee of distributed database

In the case of a large amount of distributed storage, the data in the distribution database is very likely to occur because of the machine aging and other non software level causes of some local data is not available. Once the data becomes unavailable, this means that the user cannot read the desired target data. So here we will introduce the concept of replication (replica), it is obvious that we will not put all the replication on the same node, this will lose the meaning of the copy itself. Replicas are used in many mature distributed storage systems, such as the classic HDFs three-copy strategy. When there is an error in the individual copy data, the system is able to recover quickly from other available replicas. Just as a coin has a positive and negative distinction, the benefits of system usability are brought about by replicas, and there are some problems. For example, the data consistency between replicas is guaranteed, and one thing that cannot be overlooked is the doubling of storage space caused by multiple replicas.

The above is the author's simple thought of a few, this article is a summary of the author, part of the view from the "Distributed Database System principles" chapter I, the following is a section of the relevant structural map.


[1]. The principle of distributed Database system. Chapter I.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.