Discussion on the principle of horizontal segmentation of database, design ideas--database sub-Library, sub-table, cluster, load balancer

Source: Internet
Author: User

This article reproduced: http://www.cnblogs.com/olartan/archive/2009/12/02/1615131.html

Chapter 1th Introduction

When the amount of data is large, the multiple tables are divided into different db first, then the data is distributed to different databases according to the key columns. After the library distribution, the system query, IO and other operations can have a group of multiple machines together to complete. This article is mainly aimed at, the massive database, carries on the Sub-Library, the table, the load balance principle, carries on the discussion, and proposed the solution.

With the widespread popularization of Internet application, the storage and access of massive data has become the bottleneck problem of system design. For a large-scale Internet application, billions of of PV per day is undoubtedly a very high load on the database. The stability and scalability of the system caused great problems. With data slicing to improve site performance, scaling out the data layer has become the preferred way for architects to develop. Horizontal segmentation of the database can reduce the load on a single machine, while minimizing the loss caused by downtime. The load balancing strategy can effectively reduce the access load of a single machine, reduce the possibility of the outage, and solve the problem that the single-point database cannot be accessed by the database outage through the cluster scheme, which maximizes the speed and concurrency of reading (read) data in the application through the read-write separation strategy. At present, large-scale Internet applications in China, a large number of such data segmentation scheme, taobao,alibaba,tencent, most of them realize their own distributed data access Layer (Ddal). The implementation and implementation of the hierarchy to divide, roughly divided into two levels (Java applications as an example): The JDBC layer of encapsulation, ORM Framework layer implementation. As far as the direct encapsulation of the JDBC layer is concerned, a project that is now well developed in China is a project called "Amoeba" (amoeba), developed by the Ali Group's Research institute, and is still in the beta phase, and its operational efficiency and timeliness of production remain to be studied. As far as the implementation of ORM framework layer is concerned, such as Taobao's distributed data access layer based on Ibatis and spring has been used for many years, the efficiency of operation and production effectiveness have been affirmed by developers and users. This paper is a distributed data access layer which is based on ORM framework layer. The difficulty of this subject lies in the establishment and selection of routing rules and later extensibility, such as how to achieve the goal of expanding the database capacity (increasing the machine node) with the least amount of data migration. The core issues will be expanded around the routing rules and load balancing policies of database sub-tables.

The 2nd Chapter Basic principles and concepts

2.1 Fundamentals:

The process of human cognitive problems is always the same: what-?why (why)-?how (how

), the following three issues will be discussed and studied in this article:

2.1.1 What is data slicing

The word "Shard" means "fragments" in English, and the technical terminology associated with the database seems to be the earliest seen in massively multiplayer online role-playing games. "Sharding" is called "Shard". Sharding is not a new technology, but a relatively simple concept of software. It is well known that after MySQL 5, the data table partitioning feature, before this, many of the potential users of MySQL has a concern for MySQL scalability, and whether the partitioning function is a measure of the scalability of a database is a key indicator (not the only indicator). Database extensibility is an eternal topic, and MySQL advocates often ask: how do you handle processing of application data on a single database that needs to be partitioned and so on?  The answer is: sharding. Sharding is not a feature attached to a particular database software, but an abstraction on top of specific technical details, a solution for horizontal scaling (scale out, or scale-out, and scale-out), with the main purpose of exceeding the I/O capability limits of a single-node database server. Resolve database extensibility issues.

The data is distributed horizontally to different DB or table through a series of sharding rules, and the specific DB or table that needs to be queried is found through the corresponding DB route or table routing rule for query operations. "Sharding" is usually referred to as "horizontal segmentation", which is the focus of this paper. What are the specific ways of slicing and routing? With this in hand, the reader will inevitably have some questions, and here's a simple example: we explain the logs in a blog app, such as the log article (article) table, which has the following fields:

In the face of such a table, how do we slice it? How do you distribute such data to tables in different databases? In fact, analysis of the application of the blog, we can not be difficult to draw the conclusion: the application of the blog, the user is divided into two categories: the browser and the owner of the blog. Browser browsing a blog, is actually in a specific user's blog under the browsing, and the owner of the blog to manage their own blog, but also in a specific user blog under the operation (in their own space). The so-called specific user, the field representation of the database is "user_id". This is the "user_id", which is the basis for the basis and rules of the sub-Library we need. We can do this by putting user_id all the article information in the 1~10000 into the article table in DB1, user_id all the article information in 10001~20000 into the DB2 table, and so on, until article. In this way, the article data is naturally divided into various databases, to achieve the purpose of data segmentation. The next problem to solve is how to find a specific database? In fact, the problem is simple and obvious, since the Sub-Library when we use the distinction between the field user_id, then it is natural that the process of database routing is certainly not user_id. Consider the blog application we just presented, whether it is to visit other people's blog or to manage their own blog, in short I have to know who the user of this blog, that is, we know the user_id of this blog, the use of this user_id, using the rules of the library, In turn to locate specific databases, such as user_id is 234, the use of the rules, it should be located to DB1, if user_id is 12343, using the rules, it should be positioned to DB2. And so on, using the rules of the sub-Library, the reverse route to the specific DB, this process we call "DB route".

Of course, considering that the DB design of data splitting is necessarily unconventional, unorthodox db design. So what kind of db design is the Orthodox db design?

The basic of our normal use is. Usually we will consciously design our database according to the paradigm, the load point may consider using the relevant replication mechanism to improve the throughput and performance of Read and write, which may already meet a lot of requirements, but the mechanism of its own shortcomings is relatively obvious (mentioned below). The above mentioned "consciously according to the paradigm design". Considering that the DB design of the data segmentation would violate the usual rules and constraints, in order to slice, we had to have redundant fields in the table of the database, used as a field of distinction or as a marker field called a library, such as user_id in the example above article (of course, Just now the example is not very good to reflect the redundancy of user_id, because user_id this field even if it is not divided into the library, but also to appear, we picked up the cheap bar). Of course, the appearance of redundant fields is not only in the scene of the sub-Library, in many large applications, redundancy is also necessary, this involves the design of high-efficiency db, this article will not repeat.

2.1.2 Why data splitting

There is a brief description and explanation of what is data segmentation, and the reader may wonder why it needs data segmentation. A mature and stable database like Oracle is enough to support the storage and querying of massive amounts of data? Why do we need data slicing? Indeed, Oracle's db is really mature and stable, but the high cost of use and high-end hardware support is not a business that every company can afford. Imagine a year of tens of millions of of the cost of use and tens of millions of dollars of minicomputer as a hardware support, which can be paid by the general company? Even if it is possible to pay, if there is a better solution, there is a more inexpensive and horizontal expansion of better performance solutions, why we do not choose?

However, things are always unsatisfactory. Usually we will consciously design our database according to the paradigm, the load point may consider using the relevant replication mechanism to improve the throughput and performance of Read and write, which may already meet a lot of requirements, but the mechanism of its own shortcomings is relatively obvious. First of all, it depends on the ratio of read operations, Master will often become the bottleneck, write operations need to queue to execute, overload if master first can not carry, slaves data synchronization delay may be larger, and will greatly consume CPU computing power, because write It is necessary to run on each slave machine after the master is executed. At this time sharding may become chicken ribs. Replication is uncertain, then why sharding can work? The reason is simple, because it can be well extended. We know that every machine, no matter how good it is configured, has its own physical limit, so when we apply something that has reached or goes far beyond a single machine, we can only look for help from another machine or continue to upgrade our hardware, but the common scenario is scale-out, By adding more machines to share the pressure. We also have to consider when our business logic is growing, can our machines meet demand through linear growth? Sharding can easily distribute compute, storage, I/O in parallel to multiple machines, which can take full advantage of the various processing capabilities of multiple machines while avoiding single points of failure, providing system availability and making good fault isolation.

In combination with these factors, data segmentation is necessary, and the data segmentation we are discussing here also takes MySQL as the background. Based on cost considerations, many companies also chose free and open MySQL. Developers who know about MySQL may know that the data table partitioning feature is available after MySQL 5, and before that many potential users of MySQL have concerns about MySQL extensibility, and whether partitioning is a key metric to measure the scalability of a database ( Of course not the only indicator). Database extensibility is an eternal topic, and MySQL advocates often ask: how do you handle processing of application data on a single database that needs to be partitioned and so on? The answer is also sharding, which is what we call a data segmentation scheme.

We use free MySQL and cheap server or even a PC cluster, to achieve the effect of minicomputer + large business db, reduce the amount of capital investment, reduce operating costs, why not? So, we choose Sharding, embrace sharding.

2.1.3 How to Do data segmentation

When it comes to data segmentation, again we talk about the methods and forms of data segmentation in detail and explained.

Data segmentation can be physical, the data through a series of segmentation rules to distribute the data to different DB servers, routing rules to access a specific database, so that each access to face is not a single server, but the n server, which can reduce the load pressure on a single machine.

Data segmentation can also be within the database, the data through a series of segmentation rules, the data distributed to a database of different tables, such as the article into article_001,article_002 and other sub-tables, if The horizontal flattening of the dry sub-table has a complete article table that makes up the logic, and the purpose of this is actually very simple. For example, for example, the article table now has 5000w data, at this point we need to add (insert) a new data in this table, after the insert is completed, the database will be re-indexed to this table, 5000w rows of data indexing system overhead is not negligible. But in turn, if we divide the table into 100 tables, from article_001 to article_100,5000w rows, and only 500,000 rows of data in each child table, we The time to index the Insert data in a table with only 50w rows of data is reduced by an order of magnitude, greatly improving the run-time efficiency of the DB and increasing the concurrency of the DB. Of course, the benefits of the table are unknown, and there are many obvious benefits to be found, such as the lock operation of the write operation.

In conclusion, the sub-Library reduces the load of the single point machine, and the table increases the efficiency of the data operation, especially the write operation. We still don't have a problem with how to slice. Next, we will elaborate and explain the segmentation rules.

As mentioned above, in order to achieve the level of data segmentation, in each table must have the redundant characters as the segmentation basis and the Mark field, the usual application we choose user_id as the field of distinction, based on the following three kinds of library methods and Rules: (Of course, there can be other ways)

By the number of segments:

(1) user_id for the distinction, 1~1000 corresponding db1,1001~2000 corresponding DB2, and so on;

Pros: can be partially migrated

Cons: Uneven data distribution

(2) Hash modulus:

Hash the user_id (or use the value of the user_id if the user_id is numeric), then use a specific number, such as the application needs to cut a database into 4 databases, we use 4 this number to User_ ID of the hash value of the modulo operation, that is user_id%4, so that each operation there are four possible: The result is 1 when the corresponding DB1; the result is 2 when the corresponding DB2; the result is 3 corresponds to DB3; The result is 0 corresponds to DB4, This allows the data to be distributed to 4 DB in a very uniform way.

Advantages: Uniform Data distribution

Cons: Data migration is cumbersome and cannot be apportioned according to machine performance

(3) Save the database configuration in the authentication library

is to set up a DB, this DB separately save user_id to DB mapping, each time you access the database to query the database first to obtain the specific DB information, and then we need to do the query operation.

Advantages: Strong flexibility, one-on-one relationship

Cons: More than one query before each query, performance is greatly compromised

These are the three ways that we choose in the usual development, and some of the more complex projects may be mixed with these three ways. Through the above description, we have a simple understanding and understanding of the rules of the sub-Library. Of course, there will be better and more perfect way to divide the library, but also need our continuous exploration and discovery.

The 3rd chapter is the basic outline of this subject research

The above text, we according to the human cognition thing's law, what?why?how this kind of way elaborated the database segmentation some concepts and the significance as well as some general rule of segmentation to make the summary introduction. The distributed data layer discussed in this topic is not only so, it is a complete data-tier solution, what is it exactly? In the following text, I will elaborate the whole idea and realization way of this research topic.

The Distributed data scheme provides the following functions:

(1) To provide the sub-library rules and routing rules (routerule abbreviation RR), the above description mentioned in the three segmentation rules directly embedded in the system, the specific embedding method in the next part of the detailed description and discussion;

(2) Introduce the concept of cluster (Group) to ensure the high availability of data;

(3) Introduce load Balancing strategy (loadbalancepolicy short lb);

(4) The cluster node availability detection mechanism is introduced to detect the availability of a single point machine in order to ensure the correct implementation of LB strategy, so as to guarantee the high stability of the system.

(5) The introduction of read/write separation to improve the data query speed;

Just the data layer design of the sub-database is not perfect, when the DB server on a node in the case of downtime, what will it be? Yes, we used a database slicing scheme, that is, n too machine composed of a complete db, if there is a machine down, it is only one of the db n one of the data can not access it, which we can accept, at least more than before the segmentation of the situation is much better, not the entire DB can not access. In general applications, such machine failure caused by the data is not accessible is acceptable, assuming our system is a high-concurrency e-commerce site? The economic losses caused by single-node machine outages are very serious. In other words, there is still a problem with our solution, and fault-tolerant performance is not a test. Of course, there is always a solution to the problem. We introduce the concept of clustering, which I call the Group, which is the node of each library we introduce multiple machines, each of which holds the same data, and in general the load is distributed by multiple machines, and the load balancer distributes the load to the machine that is down when there is an outage. So

Solves the problem of fault tolerance. So we introduced the concept of clustering and embedded it in our framework as part of the framework.

As shown, the entire data layer is composed of group1,group2,group3 three clusters, these three clusters are the result of data horizontal segmentation, of course, the three clusters are composed of a full data of the DB. Each group consists of 1 master (of course, master can be multiple) and n slave, which are consistent with the data of the master and slave. For example, a Group1 in a slave occurred, then there are two slave can be used, such a model will never cause a part of the data can not access the problem, unless the entire group of machines are all down, but considering that such a thing happens very small (unless it is a power outage, Otherwise it is not easy to happen).

Before we introduced a cluster, our first query process was roughly the following: requesting the data tier and passing the necessary library-differentiated fields (typically user_id)? The data layer is based on a differentiated field route to a specific db. Data operations are performed within this deterministic db. This is not the introduction of the cluster, when the introduction of the cluster will be like it? As you can see, the rules and policies on our routers can only be routed to a specific group, that is, to a virtual group, which is not a specific physical server. The next thing to do is to find a specific physical DB server for specific data manipulation. Based on the requirements of this link, we introduced the concept of load balancer (LB). The responsibility of the load balancer is to locate a specific DB server. The specific rules are as follows: The load balancer analyzes the current read and write characteristics of SQL, and if it is a write operation or requires a very real-time operation, the query load is distributed directly to master, and if it is a read operation, a Slave is assigned through a load balancing policy. The main research of our load balancer is on the load distribution strategy, typically load balancing includes random load balancing and weighted load balancing. Random load balancing It is well understood that a slave is randomly selected from n slave. Such random load balancing does not take into account machine performance, which defaults to the same performance for each machine. If the real situation is like this, it is understandable to do so. What if the situation is not so? The physical performance and configuration of each slave is not the same, and the use of random load balancing, regardless of performance, is very unscientific, which can bring unnecessary high loads to machines with poor machine performance, even the risk of downtime, while high-performance database servers do not fully perform their physical performance. Based on this consideration, we introduced a weighted load balancer, that is, within our system through a certain interface, each DB server can be assigned a weight, and then run the LB according to the weight in the cluster, a certain proportion of the load allocated to the DB server. Of course, the introduction of this concept has undoubtedly increased the complexity and maintainability of the system. There must be something lost, and we have no way to escape.

The

has a sub-library, with a cluster, with a load balancer, is it all right? Things are far less simple than we thought. With these things, we can basically guarantee that our data layer is under a lot of pressure, but such a design does not completely avoid the harm of database downtime. If the Group1 in the slave2 down, then the system lb does not know, this is actually very dangerous, because LB does not understand, it will also think Slave2 as a usable state, so still will give slave2 assigned load. As a result, the problem comes out, and the client will naturally have errors or exceptions that fail the data operation. This is very unfriendly! How to solve this problem? We introduce the availability detection mechanism of the cluster nodes, or the availability of the data push mechanism. What is the difference between these two mechanisms? First of all, the detection mechanism, as the name implies, detection, even if it is my data-tier client, uncertain when the cluster of the various databases to make an attempt to achieve the principle is to try to link, or the database port of the attempted access, can be done, of course, can also use JDBC to try to link, Use Java's exception mechanism for usability judgments, which are mentioned in the following text. What about the data push mechanism? In fact, this will be placed in the practical application scenario to discuss this problem, the general application of DB database downtime I believe the DBA must know, this time the DBA manually to the current state of the database through the program to push the client, that is, the application of the distributed data layer, this time in Updates a list of local db states. and tell lb that this database node cannot be used, please do not assign load to it. One is the active monitoring mechanism, and the other is the passive informed mechanism. Both have their own strengths. But all can achieve the same effect. As a result, the hypothetical problem will not happen, even if it happens, then the probability of occurrence will be minimized.

The above text mentions the master and slave, we did not do much in-depth explanation. As shown, a group consists of 1 master and N slave. Why did you do it? Where Master is responsible for the load of the write operation, which means that everything written is done on master, while the read operation is distributed on slave. This can greatly improve the efficiency of reading. In the general Internet application, after some data survey concluded that the ratio of read/write is about 10:1, that is, a large number of data manipulation is focused on the operation of Reading, which is why we have multiple slave reasons. But why separate reading and writing? Developers familiar with the DB know that the write operation involves the lock problem, whether it is a row lock or a table lock or block lock, is a comparison to reduce the efficiency of the system to do things. Our separation is to focus the write operation on one node, while the read operation of its other n nodes, from another aspect effectively improve the efficiency of reading, ensure the high availability of the system. Read-write separation also introduces new problems, such as how does the data on my master keep the data synchronized and consistent with the other slave machines in the cluster? This is a problem that we don't need too much attention, and the MySQL proxy mechanism can help us do that. Because the proxy mechanism is not too relevant to this subject,

No detailed introduction is made here.

In summary, the general function of the distributed data layer studied in this topic is the same. These are some of the basic principles of discussion and elaboration. Next, the system design level, in-depth analysis and research.

4th Chapter System Design

4.1 System Implementation Level selection

As mentioned in the introductory section, there are two options for the implementation of the system, one based on the JDBC level and one based on the existing data persistence layer framework, such as Hibernate,ibatis. Each of the two levels has strengths, but also each has shortcomings. Based on the system implementation at the JDBC level, the difficulty of system development and later use will be greatly improved. The development cost and maintenance cost of the system are greatly increased. The orientation of this topic is to carry on the upper layer encapsulation on the basis of the forming Ibatis persistence layer frame, instead of the direct modification of the Ibatis source code, so that the system will not have too much intrusion on the existing frame, and thus increase the flexibility of use. Ibatis is chosen for the following reasons:

(1) Ibatis learning Cost is very low, skilled Java programmer in a very short period of time to skillfully use ibatis;

(2) Ibatis is a lightweight ORM, it simply completes the ro,or mapping, and its query statement is also through the configuration file Sql-map.xml file at the native SQL level of simple configuration, that is, we did not introduce the concept of hql such as Hibernate, Thus, the controllability of SQL is enhanced, and the excellent DBA can optimize SQL from the SQL level, so that the application of the data layer is highly controllable. Hibernate is powerful, but because Hibernate is a heavy-duty package for or, and introducing the concept of HQL, it is not easy for the DBA team to tune the control and performance of SQL statements.

Based on the above two reasons, this topic chooses the easy-to-use and lightweight durable layer framework Ibatis in the selection of ORM products. The following discussion is also based on a ibatis-specific discussion.

4.2 Selection of other open source frameworks

In some large Java applications, we typically adopt an open source framework such as spring, especially the IoC (DI), which effectively helps developers manage the dependencies and hierarchies of objects and reduce the physical coupling between the various layers of the system. The advantages and usefulness of spring I believe this is a well-known developer, not to repeat here. The data layer of this topic will also adopt spring as the framework of the IOC (DI).

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.