MySQL massive data storage and access solution

MySQL massive data storage and access solution _mysql

Last Update:2017-01-18 Source: Internet

Author: User

Tags db2 dba numeric value

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1th Chapter Introduction

With the wide popularization of Internet application, the storage and access of massive data has become the bottleneck of system design. For a large Internet application, billions of PV per day is undoubtedly a considerable load on the database. It poses a great problem for the stability and extensibility of the system. Through data segmentation to improve the performance of the site, the horizontal expansion of the data layer has become the preferred way of architecture developers. Horizontal segmentation database, can reduce the load of a single machine, while minimizing the loss caused by downtime. The load balancing strategy can effectively reduce the access load of a single machine, reduce the possibility of downtime; through the cluster scheme, the problem that the single point database is not accessible by the database downtime is solved; the read-write separation strategy maximizes the speed and concurrency of reading (read) data in the application. At present, large-scale Internet applications in China, a large number of the use of such data segmentation scheme, taobao,alibaba,tencent, most of them to achieve their own distributed data access Layer (Ddal). The implementation of the way and the level of implementation of the Division, is probably divided into two levels (Java application for example): The JDBC layer of encapsulation, ORM Framework layer implementation. As far as the direct encapsulation of the JDBC layer is concerned, a project that is now well developed in China is known as the "Amoeba" (amoeba) project, developed by Ali Group's Institute, and is still in beta (Beta), and its operational efficiency and timeliness are yet to be refined. For the implementation of ORM framework layer, for example, Taobao's distributed data access layer based on Ibatis and spring has been used for many years, and its efficiency and production effectiveness have been affirmed by developers and users. This paper is a distributed data access layer based on ORM Framework layer. The difficulty of this thesis is to make and choose the routing rules after the library, such as how to achieve the expansion of the database capacity (Increase the machine node) with minimal data migration. The core issues will be based on the routing rules and load balancing strategies of database sub-tables.

2nd Chapter Basic Principles and concepts

2.1 Basic principle:

The process of human cognitive problems is always this: what (what)-?why (why)-?how (how

), next, this article will discuss and study these three questions:

2.1.1 What is Data segmentation

The word "Shard" in English means "fragmentation," and the technical terminology used as a database seems to be the earliest seen in large multiplayer online role-playing games. "Sharding" is called "fragmentation". Sharding is not a new technology, but a relatively simple concept of software. As we all know, MySQL 5 after the data table partitioning function, then many of the MySQL potential users are concerned about MySQL extensibility, and whether the partitioning function is a measure of the scalability of a database is a key indicator (certainly not the only indicator). Database extensibility is an eternal topic, and MySQL's promoter is often asked how to do this, such as processing application data on a single database that needs to be partitioned, and so on. The answer is: Sharding. Sharding is not a feature of a particular database software, but an abstraction over specific technical details, a solution that extends horizontally (Scale out, or horizontally, and outwards), with the primary purpose of breaking the I/O capability limits of a single node database server, Resolves the problem of database extensibility.

through a series of segmentation rules to distribute the data horizontally to different DB or table, in the corresponding DB routing or table routing rules to find the need to query the specific DB or table, for query operation. "Sharding" is usually referred to as "horizontal segmentation", which is also the focus of this article. What will be the specific way of segmentation and routing? In this case, the reader will inevitably have questions, and then give a simple example: we are for a blog application of the log to explain, such as the log article (article) table has the following fields:

article_id (int), title (varchar (128)), content (varchar (1024)), user_id (int)

How do we divide the face of such a table? How do you distribute such data to a table in a different database? In fact, analysis of the application of blog, we are not difficult to come to the conclusion: the application of blog, users are divided into two types: the owner of the browser and blog. Browsing a blog, is actually in a specific user's blog to browse, and the owner of the blog to manage their own blog, but also under the specific user blog to operate (under their own space). The so-called specific user, using the field of the database is "user_id". This is the "user_id", which is the basis of the basis and rules of the library we need. we can do this by putting all the article information user_id for 1~10000 into the article table in DB1, putting all the information user_id for 10001~20000 into DB2 tables in article, and so on, until DBN. in this way, the data is naturally divided into various databases, to achieve the purpose of data segmentation. The next question to be solved is how to find a specific database? In fact, the problem is also simple and obvious, since the time we use to distinguish between the field user_id, so very natural, the process of database routing, of course, is still indispensable user_id. Consider that we have just presented the blog application, whether it is to visit others blog or manage their own blog, in short I have to know who this blog is the user, that is, we know the blog user_id, on the use of this user_id, the use of the rules of the sub-Library, In turn to locate a specific database, such as user_id is 234, the use of the rules, it should be positioned to DB1, if user_id is 12343, use the rules, it should be positioned to DB2. And so on, using the rules of the library, reverse routing to the specific db, this process we call "DB routing."

Of course, considering that the DB design of data segmentation must be unconventional, unorthodox db design. So what kind of db design is the Orthodox db design?

The basics of what we usually do is. Normally we will consciously follow the paradigm to design our database, the load point may consider using the relevant replication mechanism to improve the throughput and performance of Read and write, which may already meet a lot of requirements, but the shortcomings of the mechanism itself is more obvious (hereinafter mentioned). The above mentioned "consciously according to the paradigm design". Given the DB Design for data segmentation, which would violate the usual rules and constraints, in order to split, we had to have redundant fields in the table of the database, used as a field of distinction or as a tag field called a article, such as user_id in the example above, of course, Just now the example does not very well reflect the redundancy of user_id, because user_id this field even if it is not a library, but also to appear, is that we picked up the cheap bar. Of course, the emergence of redundant fields is not only in the context of the library to appear, in many large applications, redundancy is necessary, this involves the design of efficient db, this article no longer repeat.

2.1.2 Why do I need data segmentation

The above description and explanation of what is the data segmentation, the reader may wonder why the need for data segmentation? A mature and stable database such as Oracle, sufficient to support the storage and query of massive data? Why do I need data slicing? Indeed, Oracle's db is really mature and stable, but high usage costs and high-end hardware support are not affordable for every company. Imagine a year tens of millions of of the use of the cost and often tens of millions of minicomputer as hardware support, this is the general company can afford to pay it? Even if it is possible to pay for it, if there is a better solution, and there is a cheaper and horizontally scalable solution, why don't we choose?

However, things are always unsatisfactory. Usually we will consciously follow the paradigm to design our database, the load high may consider using the relevant replication mechanism to improve the throughput and performance of reading and writing, which may already meet a lot of needs, but this set of mechanisms of its own shortcomings are more obvious. First of all, its effective depends on the proportion of read operations, Master will often become the bottleneck, write operations need to be queued to perform, overload words master first to carry, slaves data synchronization delay may also be larger, and will greatly consume CPU computing power, Because the write operation is performed on master, it needs to run once on each slave machine. At this time sharding may become a chicken. Replication is uncertain, so why sharding can work? The reason is simple, because it can be a good extension. We know that each machine, no matter how good it is configured, has its own physical limit, so when our application has been able to reach or exceed a certain limit on a single machine, we have to look for other machines to help or continue to upgrade our hardware, but the common solution is to expand horizontally, By adding more machines to share the pressure. We also have to consider that as our business logic grows, can our machines meet demand through linear growth? Sharding can easily compute, store, I/O distributed to more than one machine, so that can take full advantage of the various processing capacity of many machines, while avoiding single point of failure, provide system availability, make good error isolation.

To synthesize the above factors, data segmentation is necessary, and the data segmentation we discussed here is also the background of MySQL. Based on cost considerations, many companies have opted for free and open MySQL. Developers familiar with MySQL may know that the data table partitioning feature is only available after MySQL 5, so many MySQL potential users are concerned about MySQL extensibility, and partitioning is a key indicator of whether a database is scalable or not ( Certainly not the only indicator). Database extensibility is an eternal topic, and MySQL's promoter is often asked how to do this, such as processing application data on a single database that needs to be partitioned, and so on. The answer is also sharding, which is what we call the data segmentation scheme.

We use the free MySQL and cheap server and even PC cluster to achieve the effect of minicomputer + large commercial db, reduce the amount of capital investment, reduce operating costs, why not? So, we choose Sharding, embrace sharding.

2.1.3 How to Do data segmentation

When it comes to data segmentation, again we explain and explain the methods and forms of data segmentation in more detail.

Data segmentation can be physical, the data is distributed to different DB servers through a series of segmentation rules, and the routing rules are used to route access to specific databases, so that each access is not a single server, but n servers, which can reduce the load pressure on a single machine.

The number of data segmentation can also be in the database, through a series of segmentation rules, the data distributed to a database of different tables, such as the article into article_001,article_002 and other child tables, The goal of several child table horizontal flattening is to compose a complete article Table of logic, which is also very simple. For example, for example, the article table now has 5000w data, we need to add (insert) a new data in this table, after the insert is completed, the database will be indexed to this table, The system overhead of indexing 5000w rows of data is not negligible. But conversely, if we divide the table into 100 tables, the average of the rows from article_001 to article_100,5000w, there are only 500,000 rows of data inside each child table, When we insert data into a table that has only 50w rows of data, the time to index is reduced by an order of magnitude, greatly improving DB's run-time efficiency and increasing db concurrency. Of course, the benefits of the table are not known, there are also such as the operation of the lock operation, etc., will bring many obvious benefits.

In summary, the sub-Library reduces the load of the single point machine, and it improves the efficiency of data operation, especially the write operation. we still do not have the question of how to divide. Next, we will elaborate and explain the segmentation rules in detail.

As mentioned above, in order to achieve the level of data segmentation, in each table should have a phase of redundant characters as a segmentation basis and tag fields, the usual application of user_id as a field of distinction, based on this there are the following three kinds of ways and rules of the sub-Library: (There can be other ways, of course)

Divide by number:

(1) The corresponding DB2 of the corresponding db1,1001~2000 of the 1~1000 and user_id, etc.

Benefits: can be partially migrated

Disadvantage: Uneven data distribution

(2) The hash takes the module:

Hash the user_id (or if the user_id is a numeric value and use user_id values directly), and then use a specific number, such as the application of the need to cut a database into 4 databases, we use the number 4 to User_ ID hash value for modulo operation, that is, user_id%4, so that each operation there are four possible: The result is 1 of the corresponding DB1, the result is 2 of the time corresponding to the DB2, the result is 3 of the time corresponding to DB3, the result is 0 when the corresponding DB4, This will result in a very homogeneous allocation of data to 4 db.

Advantages: Uniform Data distribution

Disadvantage: The data migration time trouble, cannot according to the machine performance allocation data

(3) Save the database configuration in the authentication library

is to create a db, this db to save the user_id to the DB mapping relationship, every time to access the database to the database to get the specific DB information, then we need to do the query operation.

Advantages: Strong flexibility, one-to-one relationship

Disadvantage: Query every time before the query, the performance is greatly compromised

These are the three ways that we've chosen in the usual development, and some of the more complex projects may be mixed with these three ways. through the above description, we have a simple understanding and understanding of the rules of the sub-Library. Of course, there will be a better and more perfect way to separate the library, but also need our continuous exploration and discovery.

The 3rd chapter the basic outline of this subject research

The above text, we according to the human cognition thing's law, what?why?how this kind of way elaborated the database to divide some concepts and the significance as well as to some conventional segmentation rule did the summary introduction. The distributed data layer discussed in this paper is not only this, it is a complete data layer solution, what is it like? The following text, I will elaborate on the whole idea of this research topic and the realization way.

Distributed data scenarios provide the following capabilities:

(1) To provide the library rules and routing rules (Routerule RR), the above mentioned in the description of the three segmentation rules directly embedded in the system, the specific embedded way in the next section of the detailed description and exposition;

(2) Introducing the concept of cluster (group) to ensure the high availability of data;

(3) The introduction of load Balancing strategy (Loadbalancepolicy for short lb);

(4) The cluster node usability detection mechanism is introduced to detect the availability of the single point machine in order to ensure the proper implementation of LB strategy to ensure the high stability of the system.

(5) The introduction of read/write separation, improve the data query speed;

Only the data-tier design of the database is not perfect, when the DB server on a node is down, what would it be like? Yes, we used a database splitting scheme, that is, there are n too many machines to make up a complete db, if there is a machine down, it is only a db n one of the data can not access it, this is acceptable to us, at least better than the situation before the segmentation, There is no access to the entire DB. In general applications, such a machine failure caused by data access is acceptable, assuming that our system is a high concurrent e-commerce site it? The economic losses caused by single node machine downtime are very serious. In other words, there are still problems with our scheme, and fault-tolerant performance is not enough to test. Of course, there is always a solution to the problem. We introduce the concept of clustering, in this I call group, that is, each of the nodes of the library we introduced more than one machine, each machine to save the same data, in general, the number of machines to share the load, when there is downtime, load balancer will allocate load to this machine downtime. Thus,

Solves the problem of fault tolerance. So we introduced the concept of clustering and embedded it in our framework as part of the framework.

As shown in the figure above, the entire data layer has GROUP1,GROUP2,GROUP3 three clusters, the three clusters are the result of data horizontal segmentation, of course, the three clusters are composed of a full data of the DB. Each group consists of 1 master (of course master can also be multiple) and n slave, these master and slave data are consistent. For example, a slave in Group1 has been down, then there are two slave that can be used, such that the model does not always cause a certain part of the data inaccessible, unless the entire group of machines are all down, but given the probability of such a thing happening is very small (unless it is a power failure, Otherwise it's not easy to happen.)

before the cluster is introduced, our query process is roughly as follows: request the data tier and pass the necessary library-differentiated fields (usually user_id)? data layer route to specific DB based on differentiated fields? Data operations within this determined db. This is not an introduction to the cluster, what would it look like when the cluster was introduced? As you can see from figure one, the rules and policies on our routers can only be routed to a specific group, which means we can only route to a virtual group, which is not a specific physical server. The next thing to do is to find a specific physical DB server for specific data operations. Based on the requirements of this link, we introduced the concept of load balancer (LB). The responsibility of the load balancer is to locate a specific DB server. The specific rules are as follows: The load balancer analyzes the current SQL read-write characteristics, if it is a write operation or requires a strong real-time operation, the query load is distributed directly to master, and if read, a slave is assigned through the load balancing policy. Our main research on load balancers is the load distribution strategy, where load balancing includes random load balancing and weighted load balancing. Random load Balancing is well understood to randomly select a slave from n slave. Such random load balancing is not considered machine performance, it defaults to the performance of each machine is the same. If this is the case, it is understandable. What if the situation is not so? the physical performance and configuration of each slave is not the same, and it is not scientific to use random load balancing without consideration of performance, which can result in unnecessary high load and even risk of downtime for machines with poor performance. At the same time, High-performance database server can not give full play to its physical performance. Based on this consideration, we introduced a weighted load balance, that is, within our system through a certain interface, you can assign a weight to each DB server, and then run the LB according to the weight in the cluster, the proportion of the load allocated to the DB server. Of course, the introduction of such concepts undoubtedly increases the complexity and maintainability of the system. There is a loss, and we have no way to escape.

with the library, with the cluster, with the load balancer, is it all right? things are far less simple than we thought. Despite these things, basically guarantees that our data tier can withstand a lot of pressure , but such a design does not completely circumvent the risk of database downtime. if the slave2 in the Group1 is down, then the lb of the system is not known, so it is very dangerous, because LB does not know, it will also think that slave2 is available, so it will still assign the load to slave2. As a result, the problem comes out, and the client will naturally have errors or exceptions where data manipulation fails. This is very unfriendly! How to solve such a problem? We introduce the availability detection mechanism for the set cluster node, or the availability data push mechanism. What is the difference between the two mechanisms? First of all, the detection mechanism, as the name suggests, even if it is my data-tier client, uncertain when the various databases in the cluster to make usability attempts, the principle is to try to link, or database port to try to access, can do, of course, can also use JDBC to try links, Use Java's exception mechanism for usability judgments, which are mentioned in the following text. What about the data-push mechanism? In fact, this is going to be in the actual application scenario to discuss this problem, in general, when DB database downtime is applied I am sure the DBA is aware that this time the DBA manually pushes the current state of the database to the client, the application side of the distributed data tier, This is the time to update a list of local db states. and tell lb that this database node is not available, please do not assign a load to it. One is the active listening mechanism, one is the passive informed mechanism. Both have their own strengths. But all can achieve the same effect. So the hypothetical question would not have happened, and even if it had happened, the probability of the occurrence would have been minimized.

The master and slave mentioned in the above text, we did not do much in-depth explanation. As shown in Figure one, a group consists of 1 master and N slave. Why do you do that? Where Master is responsible for the load of write operations, which means that all written operations are performed on master, while the read operations are apportioned to the slave. This can greatly improve the efficiency of reading. In general Internet applications, after some data survey concluded that the ratio of reading/writing is about 10:1 , that is, a large number of data operations is concentrated in the operation of the reading, which is why we have multiple slave reasons. But why separate reading and writing? Developers familiar with DB know that writing involves locking problems, whether it's a row lock or a table lock or a block lock, which is more efficient than the execution of the system. Our separation is to centralize the write operations on one node, while reading operations on the other N nodes, from another aspect effectively improve the efficiency of reading, to ensure the high availability of the system. Read-write separation also introduces new problems, such as how does my master's data and other slave machines in the cluster keep the data synchronized and consistent? This is a problem we don't need too much attention, and the MySQL proxy mechanism can help us do this, Because the proxy mechanism is not too strong in relevance to this subject,

Do not make a detailed introduction here.

To sum up, the general function of the distributed data layer studied in this thesis is so. These are some of the basic principles of the discussion and elaboration. Then on the level of system design, in-depth analysis and research.

4th Chapter System Design

4.1 The choice of system realization level

As mentioned in the introduction, the implementation level of the system has two choices, one based on the JDBC level choice and one based on the existing data persistence layer framework, such as Hibernate,ibatis. Both have their strengths and weaknesses. Based on the JDBC level system implementation, the system development difficulty and the later use of difficulty will be greatly improved. Greatly increased the system development costs and maintenance costs. This topic is positioned on the basis of the formed Ibatis persistence layer framework, rather than the direct modification of the Ibatis source code, so that the system will not be too intrusive to the existing framework, thus increasing the flexibility of use. The reason for choosing Ibatis is as follows:

(1) Ibatis learning Cost is very low, skilled Java programmer can be proficient in a very short period of time to use Ibatis;

(2) Ibatis is a lightweight ORM that simply completes the ro,or mapping, Its query is also simple configuration through the configuration file Sql-map.xml file at the level of native SQL, which means that we have not introduced the concept of hql such as Hibernate, which enhances the controllability of SQL, Good DBA can optimize SQL from the SQL level, so that the application of the data layer is highly controllable. Hibernate is powerful, but because Hibernate is a heavy encapsulation of or, and the introduction of the HQL concept, it is not easy for the DBA team to tune the control and performance of SQL statements.

Based on the above two reasons, this thesis chooses the easy to learn and lightweight persistence layer framework Ibatis in the choice of ORM products. The following discussion is also a ibatis based discussion.

4.2 Options for other open source frameworks

In some large Java applications, we typically adopt an open source framework such as spring, especially the IoC (DI) section, which effectively helps developers manage the dependencies and hierarchies of objects and reduce the physical coupling between the various levels of the system. The advantages and usefulness of spring I believe this is well known to developers and is no longer in this context. The data layer of this project will also adopt spring as the framework of IOC (DI).

4.3 System development Technology and tools introduction

Development language: Java JDK1.5

Integrated development environment: Eclipse 3.3.4

Test server in Web environment: JBoss 4.2

Build tools: Taobao's own build tools Antx (similar to Maven), and of course can be used maven

Dependent Open source jar:spring2.0,ibaits,commons-configuration (read config file), log4j,junit, etc.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More