MySQL Storage and access solution for massive data

Source: Internet
Author: User

Chapter 1th Introduction

With the widespread popularization of Internet application, the storage and access of massive data has become the bottleneck problem of system design. For a large-scale Internet application, billions of of PV per day is undoubtedly a very high load on the database. The stability and scalability of the system caused great problems. With data slicing to improve site performance, scaling out the data layer has become the preferred way for architects to develop . Horizontal segmentation of the database can reduce the load on a single machine, while minimizing the loss caused by downtime . The load balancing strategy can effectively reduce the access load of a single machine, reduce the possibility of the outage, and solve the problem that the single-point database cannot be accessed by the database outage through the cluster scheme, which maximizes the speed and concurrency of reading (read) data in the application through the read-write separation strategy. at present, large-scale Internet applications in China, a large number of such data segmentation scheme, taobao,alibaba,tencent, most of them realize their own distributed data access Layer (Ddal). The implementation and implementation of the hierarchy to divide, roughly divided into two levels (Java applications as an example): The JDBC layer of encapsulation, ORM Framework layer implementation. As far as the direct encapsulation of the JDBC layer is concerned, a project that is now well developed in China is a project called "Amoeba" (amoeba), developed by the Ali Group's Research institute, and is still in the beta phase, and its operational efficiency and timeliness of production remain to be studied. As far as the implementation of ORM framework layer is concerned, such as Taobao's distributed data access layer based on Ibatis and spring has been used for many years, the efficiency of operation and production effectiveness have been affirmed by developers and users. This paper is a distributed data access layer which is based on ORM framework layer. The difficulty of this subject lies in the establishment and selection of routing rules and later extensibility, such as how to achieve the goal of expanding the database capacity (increasing the machine node) with the least amount of data migration. The core issues will be expanded around the routing rules and load balancing policies of database sub-tables.

The 2nd Chapter Basic principles and concepts

2.1 Fundamentals:

The process of human cognitive problems is always the same: what-?why (why)-?how (how

), the following three issues will be discussed and studied in this article:

2.1.1 What is data slicing

The word "Shard" means "fragments" in English, and the technical terminology associated with the database seems to be the earliest seen in massively multiplayer online role-playing games. "Sharding" is called "Shard". Sharding is not a new technology, but a relatively simple concept of software. It is well known that after MySQL 5, the data table partitioning feature, before this, many of the potential users of MySQL has a concern for MySQL scalability, and whether the partitioning function is a measure of the scalability of a database is a key indicator (not the only indicator). Database extensibility is an eternal topic, and MySQL advocates often ask: how do you handle processing of application data on a single database that needs to be partitioned and so on?  The answer is: sharding. Sharding is not a feature attached to a particular database software, but an abstraction on top of specific technical details, a solution for horizontal scaling (scale out, or scale-out, and scale-out), with the main purpose of exceeding the I/O capability limits of a single-node database server. Resolve database extensibility issues.

The data is distributed horizontally to different DB or table through a series of sharding rules, and the specific DB or table that needs to be queried is found through the corresponding DB route or table routing rule for query operations. "Sharding" is usually referred to as "horizontal segmentation", which is the focus of this paper. What are the specific ways of slicing and routing? With this in hand, the reader will inevitably have some questions, and here's a simple example: we explain the logs in a blog app, such as the log article (article) table, which has the following fields:


article_id (int), title (varchar), content (varchar), user_id (int)

In the face of such a table, how do we slice it? How do you distribute such data to tables in different databases? In fact, analysis of the application of the blog, we can not be difficult to draw the conclusion: the application of the blog, the user is divided into two categories: the browser and the owner of the blog. Browser browsing a blog, is actually in a specific user's blog under the browsing, and the owner of the blog to manage their own blog, but also in a specific user blog under the operation (in their own space). The so-called specific user, the field representation of the database is "user_id". This is the "user_id", which is the basis for the basis and rules of the sub-Library we need. we can do this by putting user_id all the article information in the 1~10000 into the article table in DB1, user_id all the article information in 10001~20000 into the DB2 table, and so on, until article. in this way, the article data is naturally divided into various databases, to achieve the purpose of data segmentation. The next problem to solve is how to find a specific database? In fact, the problem is simple and obvious, since the Sub-Library when we use the distinction between the field user_id, then it is natural that the process of database routing is certainly not user_id. Consider the blog application we just presented, whether it is to visit other people's blog or to manage their own blog, in short I have to know who the user of this blog, that is, we know the user_id of this blog, the use of this user_id, using the rules of the library, In turn to locate specific databases, such as user_id is 234, the use of the rules, it should be located to DB1, if user_id is 12343, using the rules, it should be positioned to DB2. And so on, using the rules of the sub-Library, the reverse route to the specific DB, this process we call "DB route".

Of course, considering that the DB design of data splitting is necessarily unconventional, unorthodox db design. So what kind of db design is the Orthodox db design?

The basic of our normal use is. Usually we will consciously design our database according to the paradigm, the load point may consider using the relevant replication mechanism to improve the throughput and performance of Read and write, which may already meet a lot of requirements, but the mechanism of its own shortcomings is relatively obvious (mentioned below). The above mentioned "consciously according to the paradigm design". Considering that the DB design of the data segmentation would violate the usual rules and constraints, in order to slice, we had to have redundant fields in the table of the database, used as a field of distinction or as a marker field called a library, such as user_id in the example above article (of course, Just now the example is not very good to reflect the redundancy of user_id, because user_id this field even if it is not divided into the library, but also to appear, we picked up the cheap bar). Of course, the appearance of redundant fields is not only in the scene of the sub-Library, in many large applications, redundancy is also necessary, this involves the design of high-efficiency db, this article will not repeat.

2.1.2 Why data splitting

There is a brief description and explanation of what is data segmentation, and the reader may wonder why it needs data segmentation. A mature and stable database like Oracle is enough to support the storage and querying of massive amounts of data? Why do we need data slicing? Indeed, Oracle's db is really mature and stable, but the high cost of use and high-end hardware support is not a business that every company can afford. Imagine a year of tens of millions of of the cost of use and tens of millions of dollars of minicomputer as a hardware support, which can be paid by the general company? Even if it is possible to pay, if there is a better solution, there is a more inexpensive and horizontal expansion of better performance solutions, why we do not choose?

However, things are always unsatisfactory. Usually we will consciously design our database according to the paradigm, the load point may consider using the relevant replication mechanism to improve the throughput and performance of Read and write, which may already meet a lot of requirements, but the mechanism of its own shortcomings is relatively obvious. First of all, its effectiveness depends on the ratio of read operations, Master will often become the bottleneck, write operations need to queue to execute, overload if master first can not carry, slaves data synchronization delay may be larger, and will greatly consume CPU computing power, Because the write operation is performed on master, it needs to be run on every slave machine. At this time sharding may become chicken ribs. Replication is uncertain, then why sharding can work? The reason is simple, because it can be well extended. We know that every machine, no matter how good it is configured, has its own physical limit, so when we apply something that has reached or goes far beyond a single machine, we can only look for help from another machine or continue to upgrade our hardware, but the common scenario is scale-out, By adding more machines to share the pressure. We also have to consider when our business logic is growing, can our machines meet demand through linear growth? Sharding can easily distribute compute, storage, I/O in parallel to multiple machines, which can take full advantage of the various processing capabilities of multiple machines while avoiding single points of failure, providing system availability and making good fault isolation.

In combination with these factors, data segmentation is necessary, and the data segmentation we are discussing here also takes MySQL as the background. Based on cost considerations, many companies also chose free and open MySQL. Developers who know about MySQL may know that the data table partitioning feature is available after MySQL 5, and before that many potential users of MySQL have concerns about MySQL extensibility, and whether partitioning is a key metric to measure the scalability of a database ( Of course not the only indicator). Database extensibility is an eternal topic, and MySQL advocates often ask: how do you handle processing of application data on a single database that needs to be partitioned and so on? The answer is also sharding, which is what we call a data segmentation scheme.

We use free MySQL and cheap server or even a PC cluster, to achieve the effect of minicomputer + large business db, reduce the amount of capital investment, reduce operating costs, why not? So, we choose Sharding, embrace sharding.

2.1.3 How to Do data segmentation

When it comes to data segmentation, again we talk about the methods and forms of data segmentation in detail and explained.

Data Segmentation can be physical , the data through a series of segmentation rules to distribute the data to different DB servers, routing rules to access a specific database, so that each access to face is not a single server, but n servers, This can reduce the load pressure on a single machine.

number According to the segmentation can also be within the database, the data through a series of segmentation rules, the data distributed to a database of different tables, such as the article into article_001,article_002 and other sub-tables, A number of sub-tables are flattened horizontally and have a complete article table that makes up the logic, which is actually very simple to do. For example, for example, the article table now has 5000w data, at this point we need to add (insert) a new data in this table, after the insert is complete, the database will re-index this table, The overhead of indexing 5000w rows of data is still not negligible. But conversely, if we divide the table into 100 tables, from article_001 to article_100,5000w, there is only 500,000 rows of data in each sub-table, At this time, we are going to a table with only 50w rows of data after the Insert data index will be a magnitude decline, greatly improving the efficiency of the DB runtime, increase the concurrency of the DB. Of course, the benefits of the table are unknown, and there are many obvious benefits to be found, such as the lock operation of the write operation.

In conclusion, the sub-Library reduces the load of the single point machine, and the table increases the efficiency of the data operation, especially the write operation. we still don't have a problem with how to slice. Next, we will elaborate and explain the segmentation rules.

As mentioned above, in order to achieve the level of data segmentation, in each table must have the redundant characters as the segmentation basis and the Mark field, the usual application we choose user_id as a distinguishing field, based on the following three kinds of library methods and rules: (There can be other ways, of course)

By the number of segments:

(1) user_id for the distinction, 1~1000 corresponding db1,1001~2000 corresponding DB2, and so on;

Pros: can be partially migrated

Cons: Uneven data distribution

(2) Hash modulus:

Hash the user_id (or use the value of the user_id if the user_id is numeric), then use a specific number, such as the application needs to cut a database into 4 databases, we use 4 this number to User_ ID of the hash value of the modulo operation, that is user_id%4, so that each operation there are four possible: The result is 1 when the corresponding DB1; the result is 2 when the corresponding DB2; the result is 3 corresponds to DB3; The result is 0 corresponds to DB4, This allows the data to be distributed to 4 DB in a very uniform way.

Advantages: Uniform Data distribution

Cons: Data migration is cumbersome and cannot be apportioned according to machine performance

(3) Save the database configuration in the authentication library

is to set up a DB, this DB separately save user_id to DB mapping, each time you access the database to query the database first to obtain the specific DB information, and then we need to do the query operation.

Advantages: Strong flexibility, one-on-one relationship

Cons: More than one query before each query, performance is greatly compromised

These are the three ways that we choose in the usual development, and some of the more complex projects may be mixed with these three ways. through the above description, we have a simple understanding and understanding of the rules of the sub-Library. Of course, there will be better and more perfect way to divide the library, but also need our continuous exploration and discovery.

The 3rd chapter is the basic outline of this subject research

The above text, we according to the human cognition thing's law, what?why?how this kind of way elaborated the database segmentation some concepts and the significance as well as some general rule of segmentation to make the summary introduction. The distributed data layer discussed in this topic is not only so, it is a complete data-tier solution, what is it exactly? In the following text, I will elaborate the whole idea and realization way of this research topic.

The Distributed data scheme provides the following functions:

(1) To provide the sub-library rules and routing rules (routerule abbreviation RR), the above description mentioned in the three segmentation rules directly embedded in the system, the specific embedding method in the next part of the detailed description and discussion;

(2) Introduce the concept of cluster (Group) to ensure the high availability of data;

(3) Introduce load Balancing strategy (loadbalancepolicy short lb);

(4) The cluster node availability detection mechanism is introduced to detect the availability of a single point machine in order to ensure the correct implementation of LB strategy, so as to guarantee the high stability of the system.

(5) The introduction of read/write separation to improve the data query speed;

Just the data layer design of the sub-database is not perfect, when the DB server on a node in the case of downtime, what will it be? Yes, we used a database slicing scheme, that is, n too machine composed of a complete db, if there is a machine down, it is only one of the db n one of the data can not access it, which we can accept, at least better than before the segmentation of the situation is much better, The whole DB is not accessible. In general applications, such machine failure caused by the data is not accessible is acceptable, assuming our system is a high-concurrency e-commerce site? The economic losses caused by single-node machine outages are very serious. In other words, there is still a problem with our solution, and fault-tolerant performance is not a test. Of course, there is always a solution to the problem. We introduce the concept of clustering, which I call the group, which is the node of each library we introduce multiple machines, each of which holds the same data, and in general the load is distributed by multiple machines, and the load balancer distributes the load to the machine that is down when there is an outage. this way,

Solves the problem of fault tolerance. So we introduced the concept of clustering and embedded it in our framework as part of the framework.


the entire data layer has GROUP1,GROUP2,GROUP3 three clusters, these three clusters are the result of data horizontal segmentation, of course, the three clusters are composed of a full data of the DB. Each group consists of 1 master (of course, master can be multiple) and n slave, which are consistent with the data of the master and slave. For example, a Group1 in a slave occurred, then there are two slave can be used, such a model will never cause a part of the data can not access the problem, unless the entire group of machines are all down, but considering that such a thing happens very small (unless it is a power outage, Otherwise it is not easy to happen).

before we introduce a cluster, our first query process is roughly the following: requesting the data layer and passing the necessary library-differentiated fields (typically user_ ID)? The data layer is based on a differentiated field route to a specific db. Data manipulation within this deterministic db. This is not the introduction of clusters, when the introduction of the cluster will be like? As you can see, our specific rules are as follows: The load balancer analyzes the current read and write characteristics of SQL, and if it is a write operation or requires a very real-time operation, the query load is distributed directly to master, and if it is a read operation, a Slave is assigned through the load balancing policy. The main research of our load balancer is the load distribution strategy, typically load balancing includes random load balancing and weighted load balancing. Random Load Balancing It is well understood that a slave is randomly selected from n slave. Such random load balancing does not take into account machine performance, which defaults to the same performance for each machine. If the real situation is like this, it is understandable to do so. What if the situation is not so? the physical performance and configuration of each slave is not the same, and it is very unscientific to use random load balancing that does not take into account performance, which can result in unnecessary high loads on machines with poor machine performance and even the risk of downtime, At the same time, high-performance database server can not fully exert its physical performance. Based on this consideration, we introduced a weighted load balancer, that is, within our system through a certain interface, each DB server can be assigned a weight, and then run the LB according to the weight in the cluster, a certain proportion of the load allocated to the DB server. Of course, the introduction of this concept has undoubtedly increased the complexity and maintainability of the system. There must be something lost, and we have no way to escape.

With a sub-library, with a cluster, with a load balancer, is it all right? things are far less simple than we thought. With these things, we can basically guarantee that our data layer is under a lot of pressure , but such a design does not completely avoid the harm of database downtime. if the Group1 in the slave2 down, then the system lb does not know, this is actually very dangerous, because LB does not understand, it will also think Slave2 as a usable state, so still will give slave2 assigned load. As a result, the problem comes out, and the client will naturally have errors or exceptions that fail the data operation. This is very unfriendly! How to solve this problem? We introduce the availability detection mechanism of cluster nodes, or the availability data push mechanism. What is the difference between these two mechanisms? First of all, the detection mechanism, as the name implies, detection, even if it is my data-tier client, uncertain when the cluster of the various databases to make an attempt to achieve the principle is to try to link, or the database port of the attempted access, can be done, of course, can also use JDBC to try to link, Use Java's exception mechanism for usability judgments, which are mentioned in the following text. What about the data push mechanism? In fact, this will be placed in a real-world application scenario to discuss this problem, the general application of DB database downtime I believe the DBA must know, this time the DBA manually to the current state of the database through the program to push to the client, that is, the application side of the distributed data layer, At this time, a list of local DB states is updated. and tell lb that this database node cannot be used, please do not assign load to it. One is the active monitoring mechanism, and the other is the passive informed mechanism. Both have their own strengths. But all can achieve the same effect. As a result, the hypothetical problem will not happen, even if it happens, then the probability of occurrence will be minimized.

Mentioned in the above text Master and slave, we have not done much in-depth explanation. As shown, a group consists of 1 master and N slave. Why did you do it? Where Master is responsible for the load of the write operation, which means that everything written is done on master, while the read operation is distributed on slave. This can greatly improve the efficiency of reading. In general Internet applications, After some data survey, it is concluded that the proportion of reading/writing is about 10:1, which means that a large number of data operations are focused on read operations, which is why we have multiple slave. But why separate reading and writing? Developers familiar with the DB know that the write operation involves the lock problem, whether it is a row lock or a table lock or block lock, is a comparison to reduce the efficiency of the system to do things. Our separation is to focus the write operation on one node, while the read operation of its other n nodes, from another aspect effectively improve the efficiency of reading, ensure the high availability of the system. Read-write separation also introduces new problems, such as how does the data on my master keep the data synchronized and consistent with the other slave machines in the cluster? This is a problem that we don't need too much attention, and the MySQL proxy mechanism can help us do that. Because the proxy mechanism is not too relevant to this subject

MySQL Storage and access solution for massive data

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.