Chapter 2 Introduction
When the data volume is huge, the multi-table is first divided into different databases, and then the data is distributed to different databases according to the key columns. After the database is distributed, the system queries, io, and other operations can all be completed by groups composed of multiple machines. This article focuses on the principles of database sharding, table sharding, and load balancing for massive databases, and proposes solutions.
With the popularization of Internet applications, the storage and access of massive data has become the bottleneck of system design. For a large Internet application, billions of PVS per day will undoubtedly cause a considerable load on the database. This poses great problems to system stability and scalability. Data splitting is used to improve website performance. horizontal scaling of the data layer has become the preferred method for architecture developers. Horizontally split databases can reduce the load on a single machine and minimize the losses caused by downtime. Through the load balancing policy, the access load of a single machine is effectively reduced, and the possibility of downtime is reduced; through the cluster solution, the Single Point Database Access failure caused by database downtime is solved; the Read/write splitting policy maximizes the speed and concurrency of reading (Read) data in applications. Currently, a large number of Internet applications in China use such data splitting solutions as Taobao, Alibaba, and Tencent. Most of them implement their own distributed data access layer (DDAL ). The implementation method and implementation level are divided into two layers (Java application as an example): The encapsulation of the jdbc layer and the implementation of the orm framework layer. For direct encapsulation of the JDBC layer, a well-developed project in China is now called Amoeba, which is developed by Alibaba Group's Research Institute, it is still in the testing stage (beta version), and its operation efficiency and production Timeliness need to be well studied. As for the implementation of the ORM framework layer, for example, Taobao's distributed data access layer based on ibatis and Spring has been applied for many years, and the running efficiency and production effectiveness have been affirmed by developers and users. This article is a distributed data access layer based on The ORM framework layer. The difficulty of this project lies in the formulation and selection of routing rules after database sharding and the scalability in the future. For example, how to migrate data with the least amount of data to expand database capacity (increase machine nodes). The core issue is the routing rules and load balancing policies for database/table sharding.
Chapter 1 Basic principles and concepts
2.1 Basic Principles:
The process of human cognitive problems is always like this: what )-? Why (why )-? How
Next, this article will discuss and study these three issues:
2.1.1 what is data splitting
The word "shard" refers to "Fragmentation" in English. As a database-related technical term, it seems to have been first seen in large multiplayer online role-playing games. "Sharding" is called "sharding ". Sharding is not a new technology, but a relatively simple software concept. As we all know, the Data Table Partitioning function was available only after MySQL 5. Before that, many potential mysql users had concerns about MySQL scalability, whether the partition function is available is a key indicator (of course not the only indicator) for measuring the scalability of a database ). Database scalability is an eternal topic. MySQL promoters are often asked: How does one implement partitioning to process application data in a single database? The answer is sharding. Sharding is not a function attached to a specific database software, but an abstract Processing Based on specific technical details. It is a horizontally scalable (scale out) solution, its main purpose is to break through the I/O capability limitations of Single-node database servers and solve database scalability problems.
Data is horizontally distributed to different databases or tables through a series of sharding rules. The specific dB or table to be queried is found through the corresponding dB route or table routing rules, to perform query operations. The "sharding" mentioned here usually refers to "Horizontal splitting", which is also the focus of this article. What are the specific splitting methods and routing methods? At this point, readers may have doubts. Next we will give a simple example: we will explain the logs in a blog application. For example, the log Article table has the following fields:
How can we split such a table? How can we distribute such data to tables in different databases? In fact, it is not difficult to analyze the blog application to conclude that there are two types of users in the blog application: the viewer and the owner of the blog. A visitor browses a blog under a specific user's blog, while the owner of the blog manages his or her own blog, the operation is also performed under a specific user blog (in your own space ). For a specific user, the database field is "user_id ". This "user_id" is the basis for database sharding and rules. We can do this and set user_id to 1 ~ Put all article information of 10000 into the article table in DB1, and set user_id to 10001 ~ All article information of 20000 is placed in the article table in DB2, and so on until DBn. In this way, the article data is naturally divided into various databases, achieving the goal of data segmentation. The next problem to be solved is how to find the specific database? In fact, the problem is also simple and obvious. Since we used to distinguish the field user_id during database sharding, it is natural that the database routing process is still subject to user_id. Consider the blog application we just presented, whether it's accessing others' blogs or managing their own blogs. In short, I want to know who this blog user is, that is, if we know the user_id of this blog, we will use this user_id and use the database sharding rules to locate the specific database. For example, if user_id is 234, we will use this rule, DB1 should be located. If user_id is 12343, then DB2 should be located using this rule. Similarly, the database sharding rule is used to reverse route to a specific database. This process is called "DB routing ".
Of course, considering that the database design for data splitting is not conventional, the database design is not orthodox. So what kind of DB Design is an orthodox DB Design?
We usually use it in a regular manner. We usually consciously design our databases according to the paradigm. The high load may consider using related replication mechanisms to improve read/write throughput and performance, which may already meet many requirements, however, the defects of this mechanism are obvious (as mentioned below ). As mentioned above, "consciously follow the paradigm design ". Considering the Database Design of Data splitting, this rule and constraint will be violated. In order to split, we have to have redundant fields in the database table for distinguishing fields or marked fields called database sharding, for example, the user_id field in the above article example (of course, the previous example does not reflect the redundancy of user_id, because the user_id field is not sharded by database, this is also coming soon, so we can take advantage of it ). Of course, the emergence of redundant fields is not only in the case of database sharding. In many large applications, redundancy is also necessary. This involves the design of efficient dB, which will not be repeated here.
2.1.2 why data splitting
What is data splitting? A brief description and explanation are provided. Readers may wonder why data splitting is required? A mature and stable database like Oracle is enough to support the storage and query of massive data? Why do we still need data slicing? Indeed, Oracle databases are mature and stable, but the high cost of use and high-end hardware support are not what every company can afford. Imagine using tens of millions of user fees and tens of thousands of minicomputers as hardware support in a year. Can this be paid by General companies? Even if we can afford it, if there is a better solution, there is a cheaper solution and a better horizontal scalability performance, why don't we choose?
However, things are always unsatisfactory. We usually consciously design our databases according to the paradigm. The high load may consider using related replication mechanisms to improve read/write throughput and performance, which may already meet many requirements, however, the defects of this mechanism are quite obvious. First of all, its effectiveness depends on the proportion of read operations. The master node often becomes the bottleneck, and write operations need to be executed in a sequential queue. If the master node is overloaded, the master node will not be able to handle it first, the latency of slaves Data Synchronization may also be large, and it will greatly consume the CPU computing power, because the write operation still needs to run once on each slave machine after being executed on the master. At this time, sharding may become a chicken rib. If replication is not feasible, why can sharding work? The principle is very simple, because it can be well extended. We know that each machine has its own physical ceiling no matter how well it is configured, so when our application can reach or far exceed a certain ceiling of a single machine, we only need to look for help from other machines or upgrade our hardware. However, common solutions are horizontal scaling. By adding more machines, we can share the pressure. We have to consider whether our machines can meet our needs through linear growth as our business logic continues to grow? Sharding allows you to easily distribute computing, storage, and I/O to multiple machines in parallel. This allows you to take full advantage of various processing capabilities of multiple machines and avoid single point of failure and provide system availability, for error isolation.
Based on the above factors, data segmentation is necessary, and the data segmentation we discuss here also uses MySQL as the background. Based on cost considerations, many companies also choose free and open MySQL. Developers who have some knowledge about MySQL may know that MySQL 5 has the data Table Partitioning function. Before that, many potential mysql users have concerns about MySQL scalability, whether the partition function is available is a key indicator (of course not the only indicator) for measuring the scalability of a database ). Database scalability is an eternal topic. MySQL promoters are often asked: How does one implement partitioning to process application data in a single database? The answer is sharding, which is what we call the data splitting solution.
We use free MySQL, cheap servers, and even PCs as clusters to achieve the effects of minicomputers and large commercial databases, reduce a large amount of capital investment, and reduce operating costs. Why not? Therefore, we choose sharding to embrace sharding.
2.1.3 how to split data
Speaking of Data splitting, Let's explain in detail the methods and forms of data splitting.
Data splitting can be physical. Data is distributed to different dB servers through a series of splitting rules, and a specific database is accessed through routing rules, in this way, each access is not faced with a single server, but with N servers, which can reduce the load pressure on a single machine.
Data splitting can also be performed in a database. Data is distributed to different tables in a database through a series of segmentation rules. For example, article is divided into sub-tables such as article_001 and article_002, the horizontal concatenation of several sub-tables forms a complete logical Article table. This is actually very simple. For example, if the article table contains million data records, add (insert) A new data record to the table. After the insert statement is complete, the database re-creates an index for this table. The system overhead of indexing million rows of data cannot be ignored. But in turn, if we divide the table into 100 tables, from article_001 to article_100, 0.5 million rows of data are averaged, and each sub-table contains only rows of data, at this time, the index creation time will be decreased by an order of magnitude after we insert data to a table with only 50 million rows of data, which greatly improves the running timeliness of the database, improves the concurrency of the database. Of course, the benefits of table sharding are unknown, and there are also lock operations such as write operations, which will bring many obvious benefits.
In summary, database sharding reduces the load on single-point machines and table sharding improves the efficiency of data operations, especially write operations. At this point, we still have no question about how to split. Next, we will describe and describe the sharding rules in detail.
As mentioned above, in order to achieve horizontal data segmentation, redundant characters must be included in each table as the basis for splitting and marking fields. In common applications, user_id is used as the distinguishing field, based on this, there are three database sharding methods and rules: (of course, there can be other methods)
By number segment:
(1) user_id is differentiated, 1 ~ 1000 corresponds to DB1, 1001 ~ 2000 corresponds to DB2, and so on;
Advantage: Partially migrated
Disadvantage: uneven data distribution
(2) hash modulo:
Hash user_id (or use the user_id value if user_id is Numeric) and then use a specific number, for example, if an application divides a database into four databases, we use the number 4 to perform a modulo operation on the hash value of user_id, that is, user_id % 4, in this case, there are four possibilities for each operation: the result is 1 corresponding to DB1; the result is 2 corresponds to DB2; and the result is 3 corresponds to DB3; when the result is 0, it corresponds to DB4. In this way, the data is evenly distributed to four databases.
Advantage: Even Data Distribution
Disadvantage: data migration is troublesome and data cannot be apportioned based on machine performance.
(3) Save the database configuration in the Authentication database
It is to create a database, which saves the ing between user_id and DB separately. Each time you access the database, you must first query the database to obtain the specific DB information, then we can perform the query operation we need.
Advantage: strong flexibility, one-to-one relationship
Disadvantage: an extra query is required before each query, which compromises the performance.
The preceding three methods are usually selected during development. In some complex projects, these three methods may be used together. Through the above description, we also have a simple understanding and understanding of database sharding rules. Of course, there will be better and better database sharding methods, and we need to constantly explore and discover them.
Chapter 1 basic outlines of this study
In the above text, we follow the laws of human cognition, what? Why? This article describes some concepts and meanings of database segmentation and briefly introduces some common segmentation rules. The distributed data layer discussed in this topic is not just that. It is a complete data layer solution. What is it like? Next, I will elaborate on the complete ideas and implementation methods of this research topic.
The Distributed Data solution provides the following functions:
(1) provide database sharding rules and routing rules (RR), and embed the rules mentioned in the preceding description into the system, the specific embedding method is described and discussed in detail in the following content;
(2) introduce the concept of cluster to ensure high data availability;
(3) introduce the load balancing policy (LB );
(4) introduce the cluster node availability detection mechanism to regularly detect the availability of single-point machines, so as to ensure the correct implementation of LB policies and the high stability of the system;
(5) Introduce read/write separation to improve the data query speed;
The data layer design of database and table sharding is not perfect. What will happen when the DB server on a node goes down? Yes, we have adopted the database splitting solution. That is to say, there are N machines that constitute a complete database. If one machine goes down, it's just that the data in one DB's N points cannot be accessed. This is acceptable. At least it is much better than before splitting, so the whole DB won't be accessible. In general applications, it is acceptable that data cannot be accessed due to such a machine failure. What if our system is a highly concurrent e-commerce website? The economic loss caused by the downtime of a single node is very serious. That is to say, there are still problems with our solutions, and fault tolerance cannot withstand the test. Of course, there are always solutions to the problem. We introduce the concept of cluster. Here we call it Group, that is, we introduce multiple machines for each database shard node, and each machine stores the same data, generally, these machines share the load. When the server goes down, the Server Load balancer allocates the load to the machine that goes down. In this way,
The Fault Tolerance problem is solved. Therefore, we introduced the concept of cluster and embedded it into our framework to become part of the framework.
As shown in, the entire data layer consists of three clusters: Group1, Group2, and group3. these three clusters are the results of horizontal data splitting, of course, these three clusters form a DB containing complete data. Each Group includes one Master (of course, the Master can also be multiple) and N Slave. These Master and Slave data are consistent. For example, if a Server Load balancer instance in Group1 is down, two other Server Load balancer instances are available. Such a model will not cause access to some data, unless all the machines in the entire Group are down, but the probability of such a thing happening is very small (unless it is power-off, it is not easy to happen ).
Before the cluster is introduced, the query process is roughly as follows: request the data layer and pass the necessary database shard differentiation field (usually user_id )? How does the data layer differentiate fields from Route to a specific DB? Perform data operations in the specified DB. This is because no cluster is introduced. What will it look like when a cluster is introduced? As shown in figure 1, we can see that the rules and policies on our routers can only be routed to a specific Group, that is, only one virtual Group can be routed. This Group is not a specific physical server. The next step is to find a physical DB server for specific data operations. Based on the requirements in this step, we introduced the concept of Load balancer (LB ). The Load balancer is responsible for locating a specific DB server. The specific rules are as follows: the Server Load balancer analyzes the read and write features of the current SQL statement. If it is a write operation or requires highly real-time operations, the query load is directly distributed to the Master, for read operations, a server Load balancer policy is used to allocate a Server Load balancer instance. Our Load balancer mainly studies the load distribution policy, which usually includes random load balancing and weighted load balancing. A random Server Load balancer is easy to understand, that is, a random Server Load balancer instance is selected from N Server Load balancer instances. This random Server Load balancer does not consider the performance of the machine. By default, the performance of each machine is the same. If this is the case, it is understandable. What if this is not the case? It is very unscientific to use random Server Load balancer without considering performance when the physical performance and configuration of each Slave machine are different, this will bring unnecessary high load to machines with poor performance, and even bring the danger of downtime. At the same time, high-performance database servers cannot give full play to their physical performance. Based on this consideration, we have introduced weighted load balancing, that is, through some interfaces within our system, we can assign a weight to each DB server, then, at runtime, LB allocates a certain proportion of load to the DB server based on the weight of the weight in the cluster. Of course, the introduction of this concept undoubtedly increases the complexity and maintainability of the system. We have no way to escape.
With Sub-databases, clusters, and Server Load balancer, is it all right? Things are far less simple than we think. Even with these things, we can basically ensure that our data layer can withstand a lot of pressure, but such a design cannot completely avoid the dangers of database downtime. If slave2 in Group1 goes down, the LB of the system cannot be known. This is actually very dangerous because LB does not know that slave2 is available, therefore, slave2 is still allocated with load. In this way, the problem arises, and the client will naturally encounter a data operation failure error or exception. This is unfriendly! How can this problem be solved? We introduce the availability detection mechanism of cluster nodes or the data push mechanism of availability. What are the differences between these two mechanisms? First, let's talk about the probe mechanism. As the name suggests, even if the probe is my data-layer Client, it will try the availability of each database in the cluster from time to time. The implementation principle is the retry link, or the attempted access to the database port can be done. Of course, you can also use the JDBC retry link and use the Java Exception mechanism to judge the availability, the details will be mentioned in the following text. What is the data push mechanism? In fact, this should be discussed in actual application scenarios. In general, if the database used for the database is down, I believe the DBA must know that, at this time, the DBA manually pushes the current status of the database to the client through a program, that is, the application end of the distributed data layer. At this time, a list of local DB statuses is updated. And inform LB that this database node cannot be used. Do not assign load to it. One is an active listening mechanism and the other is a passive notification mechanism. Both have their own strengths. However, the same effect can be achieved. In this way, the hypothetical problem will not happen. Even if it happens, the probability of occurrence will be minimized.
The Master and Slave mentioned in the above text are not explained in depth. 1. A Group consists of one Master and N Slave instances. Why? The Master is responsible for the write operation load, that is, all write operations are performed on the Master, and the read operations are distributed to the Slave. This greatly improves the reading efficiency. In general Internet applications, after some data investigation, it is concluded that the read/write ratio is about, that is to say, a large amount of data operations are concentrated on reading operations, this is why we have multiple Slave instances. But why should we separate reading and writing? R & D personnel familiar with DB know that the write operation involves the lock issue, whether it is a row lock, table lock or block lock, it is a matter of reducing the system execution efficiency. In this way, write operations are concentrated on one node, while read operations are performed on another N nodes, effectively improving the Read efficiency, this ensures the high availability of the system. Read/write splitting also introduces new problems. For example, how can data on my Master be synchronized and consistent with other Slave machines in the cluster? This is a problem that we don't need to pay too much attention to. MySql's Proxy mechanism can help us do this. Because the Proxy mechanism is not highly relevant to this topic,
We will not detail it here.
To sum up, the general functions of the distributed data layer studied in this topic are as follows. The above are some discussions and explanations on the Basic Principles. Next, we will conduct in-depth analysis and research on the system design layer.
Chapter 2 System Design
4.1 system implementation-Level Selection
As mentioned in the introduction, the Implementation Layer of the system has two options: one is based on the JDBC layer and the other is based on the existing data persistence layer framework layer, for example, Hibernate and ibatis. Each layer has its own strengths and weaknesses. Based on the system implementation at the JDBC level, the difficulty of system development and later use will be greatly improved. The system development cost and maintenance cost are greatly increased. This topic is positioned to encapsulate the upper layer on the basis of the formed ibatis persistent layer framework, rather than directly modifying the ibatis source code, in this way, the system will not be too invasive to the existing framework, which also increases the flexibility of use. The reason for choosing ibatis is as follows:
(1) ibatis has a low learning cost. Skilled Java Programmer can use ibatis in a very short time;
(2) ibatis is a lightweight ORM, but simply completed the RO, OR ing, its query statement is also through the configuration file sql-map.xml file in the native SQL layer for simple configuration, that is to say, we have not introduced HQL concepts such as Hibernate, which enhances SQL controllability. Excellent DBAs can optimize SQL at the SQL layer, this makes data-layer applications highly controllable. Although Hibernate is powerful, Hibernate is a heavy encapsulation of OR and introduces the HQL concept, which makes it difficult for the DBA team to control and optimize SQL statements.
Based on the above two reasons, this topic selects an easy-to-learn and easy-to-use lightweight persistent layer framework ibatis for the product selection of ORM. The following discussions are also based on ibatis.
4.2 selection of other open-source frameworks
In some large Java applications, we usually use open-source frameworks such as Spring, especially IoC (DI), to effectively help developers manage object dependencies and layers, reduce the physical coupling between different levels of the system. I believe that the advantages and usefulness of Spring are well known by developers and will not go into detail here. The data layer of this project also uses Spring as the IoC (DI) framework.