With the music of the hardware snapping up the escalating, the group pays the pressure to meet the demands of a hundredfold or even thousand times the surge. As the last link of the purchase of goods, it is very important to ensure the users ' fast and stable payment. So in November 15, we had a comprehensive architecture upgrade for the entire payment system, which enabled us to stabilize 100,000 orders per second. It provides a strong support for the various forms of snapping up the second kill activities. I. Library sub-table
In the internet age of redis,memcached and other caching systems, building a system that supports 100,000 read-only per second is not complicated, it is simply extending the cache node through a consistent hash, expanding the Web server horizontally, and so on. Payment system to process 100,000 orders per second, what is required is hundreds of thousands of database update operations per second (insert plus update), which is impossible to complete on any single database, so the first thing we do is to make a library and a list of order forms.
There is usually a user ID (UID) field in the database operation, so we chose to use the UID to divide the tables.
Sub-Library strategy we chose the "binary tree", the so-called "binary tree" refers to: we are in the expansion of the database, are 2 multiples for expansion. For example: 1 expansion to 2 units, 2 to 4 units, 4 to the expansion of 8 units, and so on. The advantage of this way of doing this is that when we are expanding, we need only DBAs to synchronize table-level data without having to write scripts for row-level data synchronization.
It's not enough to have a separate library, after a continuous stress test we found that concurrent updates to multiple tables are more efficient than concurrent updates to a table in the same database, so we split the order table into 10 parts in each repository: Order_0,order_1,....,order_ 9.
Finally, we put the order table in 8 libraries (numbering 1 to 8, corresponding to DB1 to DB8), each of 10 subregions (numbered 0 to 9, respectively corresponding to Order_0 to Order_9), and the deployment structure as shown in the following illustration:
Calculate database number based on UID:
Database number = (UID/10)% 8 + 1
Calculate table number based on UID:
Table number = uid% 10
When uid=9527, according to the above algorithm, in fact, the UID is divided into two parts 952 and 7, where 952 modulo 8 plus 1 equals 1 for the database number, and 7 for the table number. So uid=9527 order information needs to go DB1 the order_7 table in the library to find. The specific algorithm flow can also be seen in the following figure:
With the structure and algorithm of the sub-table, the last is to find the realization tool of the sub-database, at present, there are about two types of sub-table tools in the market: client-side library, the client completes the sub-table operation, direct-attached database using the Sub-table middleware, client-side sub-table middleware, through the middleware to complete the sub-table operation
Both types of tools are available on the market, not one of them listed here, and the overall advantages and disadvantages of these two kinds of tools. The client-side library is a direct-attached database, so the performance is 15% to 20% higher than the use of the library-scale middleware. However, with the unified middleware management, the partition-table middleware is separated from the client, and the module is divided into more clear and convenient for the DBA to manage.
We chose to divide the table in the client, because we developed and open a set of data-tier access framework, its code name is "Mango", the mango framework natively support the Sub-Library table function, and configuration is very simple. Mango Home: mango.jfaster.org Mango Source: Github.com/jfaster/mango II, order ID
The ID of the order system must have globally unique characteristics, the simplest way is to take advantage of the sequence of the database, each operation can get a globally unique ID, if you want to support processing 100,000 orders per second, that will need to generate at least 100,000 order IDs per second, Generating an ID from a database obviously does not complete the above requirements. So we can only get a globally unique order ID through memory calculations.
The most famous unique ID in the Java domain should be a UUID, but the UUID is too long and contains letters and is not suitable for an order ID. Through the repeated comparison and screening, we borrowed the Twitter snowflake algorithm to achieve a globally unique ID. The following is a simplified structure diagram of the order ID:
The figure above is divided into 3 parts: time stamp
The granularity of the timestamp is the millisecond level, and the System.currenttimemillis () is used as the timestamp when the order ID is generated. Machine number
Each order server is assigned a unique number, which is used directly as the machine number when the order ID is generated. Self-increment serial number
When there are multiple requests to generate an order ID in the same millisecond of the same server, this number is added to the current millisecond, which continues to start at 0 in the next millisecond. For example, on the same server, there are 3 requests to generate an order ID in the same millisecond, and the self-increment ordinal portion of these 3 order IDs will be 0,1,2 respectively.
With the 3-part combination above, we can quickly generate a globally unique order ID. But the light is not enough, many times we will only according to the order ID directly query order information, then because there is no UID, we do not know which sub-Library to query the table, traverse all the library of all the tables. That's obviously not going to work. So we need to add the information of the sub-Library to the order ID, and here is the Order ID simplification chart with the library table information:
We have added the library and the table information in the generated global order ID header, so we can quickly query the corresponding order information only according to the order ID.
What is specifically included in the sub-table information. The first part of the discussion is that we split the order table into 8 databases by the UID dimension. Each database 10 tables, the simplest form of the library is a 2-length string can be stored, 1th-bit database number, range 1 to 8, the 2nd-bit table number, the value range of 0 to 9.
or according to the first part based on the UID calculation of the database number and table numbering algorithm, when uid=9527, the library information = 1, the table information = 7, they are combined, two-bit sub-table information is "17". The specific algorithm flow see the following figure:
The above use of table numbering as a table information is not a problem, however, the use of database numbering as a repository of information is a hidden danger, considering the future expansion requirements, we need to expand the 8 library to 16 libraries, when the value range of 1 to 8 of the library information will not support 1 to 16 of the sub-library scenario, the library route will not be completed correctly, We refer to the question of appeal as the library information precision is missing.
In order to solve the problem of the accuracy of the library information, we need to redundancy the accuracy of the library information, that is, we now save the library information to support the future expansion. Here we assume that eventually we will be expanding to 64 databases, so the new library information algorithm is:
Sub-Library information = (UID/10)% 64 + 1
When uid=9527, according to the new algorithm, the sub-Library information = 57, where 57 is not a real database number, it redundant the last expansion to 64 database of the library information precision. We currently have only 8 databases, and the actual database number needs to be calculated according to the following formula:
Actual database number = (library information-1)% 8 + 1
When uid=9527, the library information = 57, the actual database number = 1, the sub-table information = "577".
Since we choose the mode 64来 to save the accuracy of redundant library information, the length of the preservation of the library information from 1 to 2, the last part of the table information of the length of 3. The specific algorithm flow can also be seen in the following figure:
As shown in the figure above, in the calculation of the library information in the mode of 64 redundancy of the library information precision, so that when our system needs to expand to 16 libraries, 32 libraries, 64 libraries will not have any problems.
The order ID structure above has been well satisfied with our current and subsequent expansion requirements, but given the business uncertainty, we have added 1 bits in front of the order ID to identify the order ID version, this version number is redundant data, and is not currently used. Here is the final Order ID simplification chart:
Snowflake algorithm: Github.com/twitter/snowflake Third, final consistency
So far, we have implemented the extra high concurrent write and update of order table by the Sub-Library of the UID dimension of the order table, and we can query the ordering information through the UID and OrderID. But as an open group payment system, we also need to search through the line of Business ID (also known as merchant ID, referred to as bid) to query the order information, so we introduced the bid dimension of Orders table cluster, the UID Dimension Order table cluster redundant to the bid dimension of the ordering table cluster, To query order information based on bid, you need to check the bid dimension's ordering table cluster.
The above scenario is simple, but maintaining data consistency in the two order table cluster is a hassle. The two table clusters are obviously in different database clusters, if the introduction of strong consistency distributed transactions in write and update, which will undoubtedly greatly reduce system efficiency and increase service response time, which is unacceptable to us, we have introduced a message queue for asynchronous data synchronization to achieve the final consistency of the data. Of course, all kinds of exception of message queue can cause data inconsistency, so we introduce real-time monitoring service to calculate data difference of two clusters in real time and synchronize synchronously.
The following is a simplified consistency synchronization diagram:
Iv. High availability of databases
No machine or service can guarantee the stability of the operation on line. For example, some time, a database main library down, then we will not be able to read and write to the library, online services will be affected.
The so-called database high availability refers to: When the database for a variety of reasons for problems, real-time or rapid recovery of database services and patching data, from the perspective of the entire cluster, as if there is no problem. It should be noted that the Recovery database service here does not necessarily mean fixing the original database, but also switching the service to a different standby database.
The primary task of database high availability is database recovery and data patching, which is generally used to measure the quality of high availability in terms of the length of time it takes to complete these two tasks. There is a vicious circle problem, the longer the database recovery time, the more inconsistent data, data patching time will be longer, the overall repair time will become longer. So the rapid recovery of the database is the most important of the high availability of database, imagine if we can complete the database recovery within 1 seconds of the failure of the database, repair inconsistent data and cost will be greatly reduced.
The following figure is one of the most classic master and slave structures:
The figure above has 1 Web servers and 3 databases, where DB1 is the main library, DB2 and DB3 are from the library. Here we assume that the Web server is maintained by the project team and that the database server is maintained by the DBA.
When a problem occurs from the library DB2, the DBA notifies the project team that the project team will remove DB2 from the Web service's configuration list, restart the Web server, so that the faulted node DB2 will no longer be accessed, the entire database service is restored, and the project team will add DB2 to the Web service when the DBA repairs the DB2.
When there is a problem with the main library DB1, the DBA switches the DB2 to the main library and notifies the project group that the project team is using DB2 to replace the original master library DB1, restarting the Web server, so that the Web service will use the new main library DB2, and DB1 will no longer be accessed and the entire database service is restored. When the DBA fixes the DB1, DB1 as DB2 from the library.
The classic architecture above has a big drawback: the need for DBAs and project teams to work together to restore database services, regardless of the primary or library problems, is hard to automate, and recovery engineering is too slow.
We believe that the database operations should be separated from the project team, when the database problems, the DBA should achieve a unified recovery, do not require a team operation services, so as to facilitate automation and shorten service recovery time.
Let's take a look at the high availability chart from the library:
As shown in the figure above, the Web server will no longer be directly connected from the library DB2 and DB3, but instead is connected to the LVS load balancing, which is connected from the LVS from the library. The advantage of this is that LVS can automatically perceive from the library is available, from the library DB2 downtime, LVS will not read the data request again to DB2. At the same time, the DBA needs to increase or decrease from the library node, only the independent operation of the LVS can no longer need the project team to update the configuration file, restart the server to cooperate.
Take a look at the main library high-availability chart:
As shown in the figure above, the Web server will no longer directly connect to the main library DB1, but instead connect the keepalive virtual IP, then map the virtual IP to the main library DB1, add Db_bak from the library, and synchronize the data in the DB1 in real time. Under normal circumstances, the web still reads and writes data in DB1, and when DB1 is down, the script automatically sets Db_bak as the master library and maps the virtual IP to Db_bak, which uses healthy Db_bak as the main library for read and write access. This will take only a few seconds to complete the primary database service recovery.
Combine the above structure to get the master-slave high usable structure chart:
Database high availability also includes data patching, because we are in the operation of core data, is the first log to perform the update, coupled with the implementation of the near real-time fast recovery database services, so the amount of data patching is small, a simple recovery script can quickly complete data repair. v. Classification of data
Payment system In addition to the most core payment order form and payment flow table, there are some configuration information table and some user-related information table. If all the read operations are done on the database, the system performance will be greatly compromised, so we introduced the data classification mechanism.
We simply divided the payment system data into Level 3:
Level 1th: Order data and payment flow data; These two pieces of data require high real-time and accuracy, so no caching is added, and read and write operations directly manipulate the database.
Level 2nd: User-related data, these data and user-related, with less read and write features, so we use Redis for caching.
Level 3rd: Payment configuration information, which has nothing to do with the user, has the characteristics of small data, frequent reading, and almost no modification, so we use local memory for caching.
There is a data synchronization problem with the local memory cache because the configuration information is cached in memory and local memory does not perceive the modification of the configuration information in the database, which can cause inconsistencies in data and local memory data in the database.
To solve this problem, we developed a highly available message push platform, when the configuration information is modified, we can use the push platform, to the payment system all the server Push profile update message, the server received the message will automatically update the configuration information, and give successful feedback. six, thick pipe
Hacker attacks, front-end retry, and some other reasons will cause a surge in demand, if our service is a surge of requests for a wave of death, want to recover, is a very painful and cumbersome process.
To give a simple example, our current order processing capacity is an average of 100,000 orders per second, peak 140,000 the next single per second, if the same second there are 1 million orders to enter the payment system, there is no doubt that our entire payment system will collapse, followed by endless requests will let our service cluster can not start up, The only way is to cut off all traffic, restart the entire cluster, and then slowly import traffic.
We add a layer of "thick and thin pipe" to the external Web server, which can solve the above problem very well.
Here is a simple structure diagram of thick and thin pipes:
Look at the chart above, where HTTP requests go through a thick pipe before entering a Web cluster. The entry port is foul language, we set the maximum support 1 million requests per second, redundant requests will be directly discarded. The exit port is a fine port, and we set the Web cluster to 100,000 requests per second. The remaining 900,000 requests are queued in the thick and thin pipes, waiting for the Web cluster to process the old requests before new requests come out of the pipeline for the Web cluster to process. This allows the Web cluster to process more than 100,000 requests per second, under which the various services in the cluster will run and the entire cluster will not stop serving because of the surge in requests.
How to implement thick and thin pipes. Nginx Business version has been supported, the relevant information please search
Nginx Max_conns, it is important to note that the Max_conns is the number of active connections, in addition to the need to determine the maximum TPS, you need to determine the average response time.
Nginx Related:http://nginx.org/en/docs/http/ngx_http_upstream_module.html