Chapter 1th Introduction
With the widespread popularization of Internet application, the storage and access of massive data has become the bottleneck problem of system design. For a large-scale Internet application, billions of of PV per day is undoubtedly a very high load on the database. The stability and scalability of the system caused great problems. With data slicing to improve site performance, scaling out the data layer has become the preferred way for architects to develop.
- Horizontal Segmentation database: Can reduce the load of a single machine, while minimizing the loss caused by downtime;
- Load balancing strategy: can reduce the access load of a single machine, reduce the possibility of downtime;
- Cluster Solution: Solves the problem that the single point database can not be accessed by the database outage;
- Read and write separation strategy: maximize the speed and concurrency of reading data in the application;
2nd Chapter Fundamentals and concepts What is data segmentation
The word "Shard" means "fragments" in English, and the technical terminology associated with the database seems to be the earliest seen in massively multiplayer online role-playing games. "Sharding" is called "Shard". Sharding is not a feature attached to a particular database software, but an abstraction on top of specific technical details, a solution for horizontal scaling (scale out, or scale-out, and scale-out), with the main purpose of exceeding the I/O capability limits of a single-node database server. Resolve database extensibility issues. The data is distributed horizontally to different DB or table through a series of sharding rules, and the specific DB or table that needs to be queried is found through the corresponding DB route or table routing rule for query operations. "Sharding" is usually referred to as "horizontal slicing", which is the focus of this paper. Here's a simple example: we explain the logs in a blog app, such as the log article (article) table, which has the following fields:
In the face of such a table, how do we slice it? How do you distribute such data to tables in different databases? We can do this by putting user_id all the article information in the 1~10000 into the article table in DB1, user_id all the article information in 10001~20000 into the DB2 table, and so on, until article. In this way, the article data is naturally divided into various databases, to achieve the purpose of data segmentation.
The next problem to solve is how to find a specific database? In fact, the problem is simple and obvious, since the Sub-Library when we use the distinction between the field user_id, then it is natural that the process of database routing is certainly not user_id. Is that we know the user_id of this blog, the use of this user_id, using the rules of the time of the library, in turn to locate the specific database. For example, user_id is 234, using just the rules, it should be positioned to DB1, if user_id is 12343, using the rules, it should be positioned to DB2. And so on, using the rules of the sub-Library, the reverse route to the specific DB, this process we call "DB route".
Normally we will consciously design our database according to the paradigm, considering that the DB design of data segmentation will violate the usual rules and constraints. In order to slice, we have to have redundant fields in the table of the database, as a field of distinction or as a tag field called a library. such as the above article example of user_id such fields (of course, the example did not very well reflect the user_id redundancy, because user_id this field even if it is not the library, but also to appear, we picked up a cheap bar). Of course, the appearance of redundant fields is not only in the scene of the sub-Library, in many large applications, redundancy is also necessary, this involves the design of high-efficiency db, this article will not repeat.
Why data splitting
There is a brief description and explanation of what is data segmentation, and the reader may wonder why it needs data segmentation. A mature and stable database like Oracle is enough to support the storage and querying of massive amounts of data? Why do we need data slicing?
Indeed, Oracle's db is really mature and stable, but the high cost of use and high-end hardware support is not a business that every company can afford. Imagine a year of tens of millions of of the cost of use and tens of millions of dollars of minicomputer as a hardware support, which can be paid by the general company? Even if it is possible to pay, if there is a better solution, there is a more inexpensive and horizontal expansion of better performance solutions, why we do not choose?
We know that every machine, no matter how good it is configured, has its own physical limit, so when we apply something that has reached or goes far beyond a single machine, we can only look for help from another machine or continue to upgrade our hardware, but the common scenario is scale-out, By adding more machines to share the pressure. We also have to consider when our business logic is growing, can our machines meet demand through linear growth? Sharding can easily distribute compute, storage, I/O in parallel to multiple machines, which can take full advantage of the various processing capabilities of multiple machines while avoiding single points of failure, providing system availability and making good fault isolation.
Combining the above factors, data segmentation is very necessary. We use free MySQL and cheap server or even a PC cluster, to achieve the effect of minicomputer + large business db, reduce the amount of capital investment, reduce operating costs, why not? So, we choose Sharding, embrace sharding.
How to Do data segmentation
Data Segmentation can be physical , the data through a series of segmentation rules to distribute the data to different DB servers, routing rules to access a specific database, so that each access to face is not a single server, but n servers, This can reduce the load pressure on a single machine.
Data segmentation can also be within the database , the data through a series of segmentation rules, the data distributed to a database of different tables, such as the article into article_001,article_002 and other sub-tables, several sub-table horizontal flattening There is a complete article table that makes up the logic, and the purpose of this is actually very simple. For example, for example, the article table now has 5000w data, at this point we need to add (insert) a new data in this table, after the insert is completed, the database will be re-indexed to this table, 5000w rows of data indexing system overhead is not negligible. But in turn, if we divide the table into 100 tables, from article_001 to article_100,5000w rows, and only 500,000 rows of data in each child table, we The time to index the Insert data in a table with only 50w rows of data is reduced by an order of magnitude, greatly improving the run-time efficiency of the DB and increasing the concurrency of the DB. Of course, the benefits of the table are unknown, and there are many obvious benefits to be found, such as the lock operation of the write operation.
In conclusion, the sub-Library reduces the load of the single point machine, and the table increases the efficiency of the data operation, especially the write operation. We still don't have a problem with how to slice. Next, we will elaborate and explain the segmentation rules.
As mentioned above, in order to achieve the level of data segmentation, in each table must have the redundant characters as the segmentation basis and the Mark field, the usual application we choose user_id as the field of distinction, based on the following three kinds of library methods and Rules: (Of course, there can be other ways)
(1) section
user_id for 1~1000 corresponding db1,1001~2000 corresponding DB2, and so on;
Pros: can be partially migrated
Cons: Uneven data distribution
(2) Hash modulo partitioning
Hash the user_id (or use the value of the user_id if the user_id is numeric), then use a specific number, such as the application needs to cut a database into 4 databases, we use 4 this number to User_ ID of the hash value of the modulo operation, that is user_id%4, so that each operation there are four possible: The result is 1 when the corresponding DB1; the result is 2 when the corresponding DB2; the result is 3 corresponds to DB3; The result is 0 corresponds to DB4. This allows the data to be distributed to 4 DB in a very uniform way.
Advantages: Uniform Data distribution
Cons: Data migration is cumbersome and cannot be apportioned according to machine performance
(3) Save the database configuration in the authentication library
is to set up a DB, this DB separately save user_id to DB mapping, each time you access the database to query the database first to obtain the specific DB information, and then we need to do the query operation.
Advantages: Strong flexibility, one-on-one relationship
Cons: More than one query before each query, performance is greatly compromised
These are the three ways that we choose in the usual development, and some of the more complex projects may be mixed with these three ways. Through the above description, we have a simple understanding and understanding of the rules of the sub-Library. Of course, there will be better and more perfect way to divide the library, but also need our continuous exploration and discovery.
The 3rd chapter is the basic outline of this subject research
The Distributed data scheme provides the following functions:
(1) Provision of sub-library rules and routing rules (routerule abbreviation RR);
(2) Introduce the concept of cluster (Group) to ensure the high availability of data;
(3) Introduce load Balancing strategy (loadbalancepolicy short lb);
(4) The cluster node availability detection mechanism is introduced to detect the availability of a single point machine in order to ensure the correct implementation of LB strategy, so as to guarantee the high stability of the system.
(5) The introduction of read/write separation to improve the data query speed;
Just the data layer design of the sub-database table is not perfect, when we adopt the method of partitioning, that is to say, n machines make up a complete db. If there is a machine down, it is only one of the db n points of data can not access it, which we can accept, at least more than before the segmentation of the situation is much better, not the whole DB can not access.
In general applications, such machine failure caused by the data is not accessible is acceptable, assuming our system is a high-concurrency e-commerce site? The economic losses caused by single-node machine outages are very serious. In other words, there is still a problem with our solution, and fault-tolerant performance is not a test. Of course, there is always a solution to the problem. We introduce the concept of clustering , which I call the group, which is the node of each library we introduce multiple machines, each of which holds the same data, and in general the load is distributed by multiple machines, and the load balancer distributes the load to the machine that is down when there is an outage. This solves the problem of fault tolerance.
As shown, the entire data layer is composed of group1,group2,group3 three clusters, these three clusters are the result of data horizontal segmentation, of course, the three clusters are composed of a full data of the DB. Each group consists of 1 master (of course, master can be multiple) and n slave, which are consistent with the data of the master and slave. For example, a Group1 in a slave occurred, then there are two slave can be used, such a model will never cause a part of the data can not access the problem, unless the entire group of machines are all down, but considering that such a thing happens very small (unless it is a power outage, Otherwise it is not easy to happen).
Before we introduced the cluster, our first query process was roughly the following: requesting the data layer and passing the necessary library-differentiated fields (typically user_id). The data layer carries out data operations within the defined DB based on the differentiated field route to the specific DB.
This is not the introduction of the cluster, when the introduction of the cluster will be like it? The rules and policies on our routers can only be routed to a specific group, which can only be routed to a virtual group, which is not a specific physical server. The next thing to do is to find a specific physical DB server for specific data manipulation.
Based on the requirements of this link, we introduce the concept of load Balancer (LB), the duty of the load balancer is to locate a specific DB server. The specific rules are as follows: The load balancer analyzes the current read and write characteristics of SQL, and if it is a write operation or requires a very real-time operation, the query load is distributed directly to master, and if it is a read operation, a Slave is assigned through a load balancing policy.
The main research direction of our load balancer is the load distribution strategy, where load balancing typically includes random load balancing and weighted load balancing. Random load balancing It is well understood that a slave is randomly selected from n slave. Such random load balancing does not take into account machine performance, which defaults to the same performance for each machine. If the real situation is like this, it is understandable to do so. What if the situation is not so? The physical performance and configuration of each slave is not the same, and the use of random load balancing, regardless of performance, is very unscientific, which can bring unnecessary high loads to machines with poor machine performance, even the risk of downtime, while high-performance database servers do not fully perform their physical performance. Based on this consideration, we introduced a weighted load balancer, that is, within our system through a certain interface, each DB server can be assigned a weight, and then run the LB according to the weight in the cluster, a certain proportion of the load allocated to the DB server. Of course, the introduction of this concept has undoubtedly increased the complexity and maintainability of the system. There must be something lost, and we have no way to escape.
With a sub-library, with a cluster, with a load balancer, is it all right? Things are far less simple than we thought. With these things, we can basically guarantee that our data layer is under a lot of pressure, but such a design does not completely avoid the harm of database downtime. If the Group1 in the slave2 down, then the system lb does not know, this is actually very dangerous, because LB does not understand, it will also think Slave2 as a usable state, so still will give slave2 assigned load. As a result, the problem comes out, and the client will naturally have errors or exceptions that fail the data operation.
This is very unfriendly! How to solve this problem? We introduce the availability detection mechanism of the cluster nodes, or the availability of the data push mechanism. What is the difference between these two mechanisms? First of all, the detection mechanism, as the name implies, detection, even if it is my data-tier client, uncertain when the cluster of the various databases of the attempt to make the implementation of the principle is to try to link, or database port of the attempt to access, can do.
What about the data push mechanism? In fact, this will be placed in the practical application scenario to discuss this problem, the general application of DB database downtime I believe the DBA must know, this time the DBA manually to the current state of the database through the program to push the client, that is, the application of the distributed data layer, this time in Updates a list of local db states. and tell lb that this database node cannot be used, please do not assign load to it. One is the active monitoring mechanism, and the other is the passive informed mechanism. Both have their own strengths. But all can achieve the same effect. As a result, the hypothetical problem will not happen, even if it happens, then the probability of occurrence will be minimized.
The above text mentions the master and slave, we did not do much in-depth explanation. A group consists of 1 master and N slave. Why did you do it? Where Master is responsible for the load of the write operation, which means that everything written is done on master, while the read operation is distributed on slave. This can greatly improve the efficiency of reading. In the general Internet application, after some data survey concluded that the ratio of read/write is about 10:1, that is, a large number of data manipulation is focused on the operation of Reading, which is why we have multiple slave reasons.
But why separate reading and writing? Developers familiar with the DB know that the write operation involves the lock problem, whether it is a row lock or a table lock or block lock, is a comparison to reduce the efficiency of the system to do things. Our separation is to focus the write operation on one node, while the read operation of its other n nodes, from another aspect effectively improve the efficiency of reading, ensure the high availability of the system.
Original:
http://zhengdl126.iteye.com/blog/419850
Analysis of the implementation principle of horizontal segmentation of database--sub-library, sub-table, master-slave, cluster, load balancer (RPM)