Distributed advanced Database (nostalgia for the high-level database I have passed)

Source: Internet
Author: User

Advanced Database Technology

(a) Introduction

The development of the database system began in the the 1860s, from the IBM hierarchical model IMS, the network model, the relational model, to the coexistence of most models (isomorphism and heterogeneous and their mixed form coexistence). In particular, three winners of the Turing Award in the database field, Charles Bachmann, Edgar Cod and James Gray, contributed to the development of the database. In particular, the "relational data model for large shared databases" published by the Freudenberg in 1970 lays the foundation for a relational database. With the development of science and technology, the requirements of various industries have brought great impetus to the requirement of database, which makes the combination of database and distributed technology, parallel computing, artificial intelligence and so on, and produces many new database technologies. The distributed database technology originates from the middle of the 70 's. The two reasons for its development are: first, application requirements, and the development of the two hardware environments. This kind of business and organization which involves multiple geographical distribution not only contains local management, but also has high requirements for global scheduling and management, while centralized database can not meet the needs of enterprises. At the same time in the hardware environment, the rapid development of the Internet and computer technology. It is because of these two reasons, a system that can be distributed in different physical address, meet the needs of the reality, but also has the computer database requirements of the emergence of people's desire.

Since then, the research and development of distributed database system has started. The earliest systems were the United States SDD-1 (CCA), IBM's R Distributed database system, Berkeley's Ingres system and so on. At the same time, with the development of technology, commercial database Oracke, Sybase, DB2 and so on have also begun to introduce the technology of distributed database, they all support the loosely-coupled transaction management mechanism based on component and middleware to realize the management of distributed database, with the advantages of high flexibility and expansibility. And it will replace the traditional distributed medium-tight coupling transaction management mechanism.

In recent years, with the development of Internet and web, the distributed system of Web system has become the mainstream, and with the new technology such as cloud computing and Internet of things, the important position of distributed database technology is more prominent. Distributed data processing is an essential part of the system, it involves the distributed storage management of the distributed data, the query and optimization of the distribution, the distributed transaction management, the fault recovery, and the concurrency control.

This paper mainly introduces the concept and characteristics of distributed database system, and the function and characteristics of distributed database system. This is the first part of the content. The second part describes the structure of the distributed database system. The third section describes the design methodology for distributed database systems, including the design of shards and allocations (the following is described in the second article). The forth part explains the distributed query optimization technology, including the query decomposition to the query localization, then to the global query storage optimization, and finally the local query optimization and other content. Part V is distributed transaction management technology and distributed recovery and concurrency control, transaction management includes the concept of transaction, transaction implementation model, execution management model and so on. Distributed recovery is the description of failures, recovery methods, and reliable protocols, which are combined into the content of a centralized database. Then there is concurrency control, which includes the concept of concurrency control, the method of concurrency control, the explanation of locks, the management of deadlocks, and so on.

(ii) relational database

The "relational data model for large shared Databases", published in 1970, lays the foundation for a relational database. While the paper, which was published by the company at the time, did not attract attention, the relational database has been established as an important part of the paper when it resonates at international conferences. Oracle should be a potential starter.

The data model mainly consists of three parts: data structure, operation, integrity constraints. The data structure mainly includes: Hierarchical model, network model, relational model, etc. And the relationship model can be called a two-dimensional table structure, or a relational entity model, in fact, they are all equivalent. Today's database is a lot of this heterogeneous network of multi-model coexistence.

In the data operation, there are related algebra, relational calculus, SQL language and so on. Relational algebra is the internal language of the database system, and SQL is a structured query language, which is the external language of the database, that is, the query statement. Relational calculus is a form of expression similar to formal.

In data integrity constraints there is entity integrity, referential integrity, and user-defined integrity. Entity integrity is the entity's primary key is not NULL, referential integrity refers to the primary key in the main table from the foreign key in the table, master and slave tables must be consistent with the main Table properties, user-defined integrity is the user's own defined constraints, such as the default property values. At the same time in the design of the database is still some paradigm needs to be considered, there are 1nf,2nf,3nf,bcga,5nf and so on, 1NF is the entity integrity, can not be divided. 2NF eliminates partial dependencies within the entity. 3NF is the elimination of dependency transfer problems. Of course, the design generally satisfies the 3 paradigm, but also needs to consider according to the actual situation, the so-called water no normal, fire impermanence shape, sometimes there are some appropriate redundancy is also possible. Of course, with the development of equipment performance and technology, the future may meet the requirements of 4 paradigm, 5 paradigm. This is the theory of normalization.

For a relational database, its structure differs from the distributed database schema, which is a three-tier pattern and two-tier mapping. Corresponds to Figure 1, the upper layer is multi-sub-mode, can correspond with the global external mode, then the sub-mode to the pattern of the image, followed by (global) mode, in the image to internal mode, and finally (local) inside the mode. Of course, there is no global and partial, because they are centralized databases, distributed in a physical site, when it developed into a distributed database, this becomes a four layer pattern three layer mapping.


On the design of the database we are generally following the implementation phase from requirement analysis, to summary design (get ER model), to logical design (build data model), to physical design (generate corresponding table in specific database). Therefore, in this process, we should flexibly use the design experience and the content of the standardized theory, to be targeted.

At the same time, for the centralized database, it also involves query optimization, fault recovery, concurrency control and other content. This centralized content is similar to the local db of the distributed database, that is, the processing on a single site, and there are some differences that are unique to the distribution, as discussed in the distributed tutorial.

(iii) Distributed database systems

Distributed database system is developed with the development of application requirements and hardware environment. Compared to the centralized database system is very different, first in its concept, it is physically distributed in different sites, logically centralized management, through the computer network directly connected to a system, shielding the physical distribution of the characteristics (distribution transparency), not feel the difference between the remote, to achieve seamless connection. At the same time, centralized management and local control decentralized management of distributed database system are combined. Local data local preservation and maintenance, while access to offsite data, offsite data has global management (Coordinator) global control. From the definition we can see that a distributed database system should have the characteristics of a few aspects:

(1) Physical distribution, dispersed in a number of venues, which is the largest difference with the centralized database;

(2) Logical integrity, this side of the requirements for the physical distribution to be transparent, this and the decentralized database is a huge difference, which has a global database gdb and local database LDB, they correspond to Gdbms and Ldbms;

(3) The autonomy of the site, which has the characteristics of autonomous systems, with intelligence, it and the multi-processing and scheduling of the system is different.

With respect to physical distribution and logical integrity, we can also draw some subtle features about distributed databases:

(1) The data for each site is transparent, which includes data fragmentation transparency, data replication transparency, data location transparency;

(2) The strategy of combining concentration and autonomy;

(3) Appropriate redundancy, although it will easily result in inconsistent data, but for the distributed user query and retrieval speed will improve, improve system performance and reliability availability, but not conducive to data updates;

(4) The distribution of transaction management.

For the type of distributed database, according to the data model of local database management system can be divided into isomorphism and heterogeneity, of course, homogeneous type can be re-divided. This is also a variety of centralized database data model and different vendors to develop products different results. But according to the type of global control system can be divided into: Global centralized control type, global control decentralized, global control variable (and master-slave points). This classification is mainly due to the different distribution of global control.

Of course, the key technologies in distributed systems such as database design (fragmentation and allocation involved), query optimization, transaction and concurrency control (failure recovery), security and so on are the key. This is due to the unique advantages of distributed distribution:

(1) suitable for distributed management, effectively improve system performance;

(2) good economic flexibility;

(3) The system is reliable and the availability is strong.

There are also some drawbacks to distributed systems:

(1) System design complexity, fragmentation and distribution, system performance, response speed, availability will be affected each other. In particular, distributed transaction management, fault recovery, concurrency control is still complex;

(2) The system processing and maintenance of complex, especially the consistency of data, which requires transaction and fault handling;

(3) database security and confidentiality is difficult to control, because of distributed characteristics and autonomy, different venues can be different processing and so on.

(iv) Distributed database system architecture

4.1

Architecture is a guide to standardizing systems. Usually we can describe the architecture of a system in three different angles, which are hierarchy-based, component-based, and data-schema-based description methods. The hierarchy-based structure describes the system from the different levels of the system, and the component-based description is the description of the relationship between the components and the components of the system; The structure description based on the data schema defines the different data type structures and their relationships, defining the various views available to the corresponding components. The structure based on the data schema is very suitable for the database schema.

Hierarchy-based descriptions can make the client/server description, and can also be based on the "middleware" client/server structure, Figure 2 is based on the client and server description form diagram.



AP for application processor, complete the client's user query processing and distributed data processing software module. DP is a software module for data processor and complex data management. CM is a communications processor responsible for transmitting commands and data between multiple venues for APS and DP. Can be seen for the C/s structure can be divided into single ap/dp, and multi-AP/DP, single ap/multi-DP, multi-ap/multi-DP and so on structure. This is based on the division of servers and clients at this level. The middleware is to optimize the reuse degree, realize the low coupling situation, and provide the utilization of the system.

The data schema of the distributed database we can see is very different from the centralized database, layer four pattern three layer mapping. The draft standard of distributed database system developed in China gives this abstract four-layer pattern as shown in specific 3. The outermost is the global outer layer, GES is the global mode, can be divided into outer mode 1, external mode 2, external mode n. Called the global user view, which is the use of views to define global out-of-mode, which is the highest abstraction of a global user in a distributed database system, can use a global user view, and does not need to be concerned with the underlying implementation, which is also transparent. The second layer is the global concept layer, called the Global Concept view, which is the whole abstraction of the distributed database, including all the data characteristics and logical structure. The global view is mapped into local mode through sharding and allocation patterns. A shard is a logical division of global data, a data shard or definition fragment, and an image between a global relationship and a fragment. Allocation is based on the selected data distribution strategy, define the physical effective site of each fragment, define the type of fragment mapping, the degree of redundancy and redundancy of the top-down DB, and finally achieve one by one or more of the logic and site mapping. The local concept layer LCS, also known as the partial concept view, is a subset of the global conceptual pattern, which describes the logical structure of local data on the local site. A local conceptual model on a site is a collection of physical images of all global relationships on that site at that site. Finally, the local inner LIS, which contains several local patterns, is similar to the inner layer of a centralized database. It contains not only a description of the data access at that local site, but also a description of the global data at that site. This is the data model structure for this four-layer pattern three-layer mapping.


The component structure of Ddbs has application processor, data processor, local dispatch manager, local recovery manager, Storage Manager, etc., user processor AP, define some user interface, semantic data control, distributed query processing, distributed transaction management and global dictionary, etc. Its important function is two, which is the function of the user translator, is to translate the user commands in the data manipulation language into canonical commands, and the other is to translate data processor data into user-understood data. The function of the data processor is to read the data of the database, translate the normalized command into a physical command, take charge of choosing the most path to the physical data structure, and the other is to transform the physical information into normalized data. 2 and Figure 4 show the function diagram of the component structure. The user issues a command to the user's processor, which gives the user the command translator to translate into a normalized command, checks the constraints, and then forms a normalized command, passing the normalized command to the data processor's canonical command translator, translating it into a physical command, accessing the database with the support of the processor to get physical data, The results are normalized, converted to normalized data, and then formatted with the user's processor as a result of the user's understanding and ultimately returned to the user. The external model, the conceptual model, and the internal model correspond to the contents of each stage, as shown in 4.


(right here, the analysis of distributed database design next time)

Distributed advanced Database (nostalgia for the high-level database I have passed)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.