As the Lego hardware snapped up, the group paid a surge of pressure on demand and even thousand times. As the final link in the purchase of goods, it is important to ensure that users are paid quickly and stably. So in November 15, we made a full-scale architecture upgrade of the entire payment system, with the ability to stabilize 100,000 orders per second. It provides a strong support for the various forms of snapping up second-kill activities in the visual ecology.
I. Sub-database sub-table
In the age of the Internet, where cache systems are prevalent in the redis,memcached, it is not complicated to build a system that supports 100,000 reads per second, simply by extending the cache node through a consistent hash, scaling the Web server horizontally, and so on. Payment system to process 100,000 orders per second, need is hundreds of thousands of per second database update operations (insert plus update), which is not possible on any single database on the task, so the first thing we need to do is the order table (short order) for the sub-database and sub-table.
The user ID (UID) field is generally available for database operations, so we choose to use the UID for the sub-database table.
Sub-Library strategy we chose the "Binary Tree sub-database", the so-called "binary tree sub-database" refers to: we are in the database expansion, are in multiples of 2 to expand. For example: 1 units to expand to 2 units, 2 to expand to 4 units, 4 units to expand to 8 units, and so on. The advantage of this kind of library is that when we scale up, we only need the DBA to synchronize table-level data without having to write our own script for row-level data synchronization.
It is not enough to have a separate library, and after a continuous stress test we find that concurrent updates to multiple tables in the same database are much more efficient than concurrent updates to a single table, so we split the order table into 10 copies in each sub-Library: order_0,order_1, ...., Order_9.
Finally we put the order table in 8 sub-Libraries (numbers 1 to 8, respectively, corresponding to DB1 to DB8), each of the 10 sub-tables (numbering 0 to 9, respectively, corresponding to Order_0 to Order_9), the deployment structure as shown:
Calculate database number by UID:
Database number = (UID/10)% 8 + 1
Calculate the table number according to the UID:
Table number = uid% 10
When uid=9527, according to the above algorithm, in fact, the UID is divided into two parts 952 and 7, where 952 modulo 8 plus 1 equals 1 is the database number, and 7 is the table number. So uid=9527 's order information needs to go to the order_7 table in the DB1 library to find. The specific algorithm flow can also be found in:
With the structure and algorithm of the sub-database, the last is to find the realization tool of the sub-database table, there are about two types of sub-database Tools on the market:
Client Sub-Library sub-table, the client to complete the sub-database table operation, direct connection database
Using the Sub-Library sub-table middleware, the client-side database sub-table middleware, the middleware to complete the sub-database sub-table operation
These two types of tools are available on the market and are not listed here, and generally these two kinds of tools have their pros and cons. The client library sub-table is 15% to 20% higher in performance than using the Library sub-table middleware because of the direct-attached database. The use of Sub-Library sub-table middleware because of the unified management of the middleware, the sub-database sub-table operation and client isolation, the module division is more clear, easy for DBAs to unified management.
We chose to divide the table in the client, because we developed and open source a set of data-layer access framework, its code name is "Mango", the mango framework natively supports the Library sub-table function, and configuration is very simple.
Mango Home: mango.jfaster.org
Mango Source: Github.com/jfaster/mango
Two. Order ID
The ID of the order system must have globally unique characteristics, the simplest way is to take advantage of the database sequence, each operation will be able to obtain a globally unique self-increment ID, if you want to support processing 100,000 orders per second, it will need to generate at least 100,000 order ID per second, Generating a self-generated ID from a database clearly does not fulfill these requirements. So we can only get the globally unique order ID from the memory calculation.
The most famous unique ID in the Java field should be the UUID, but the UUID is too long and contains letters, and is not suitable as an order ID. Through the repeated comparison and screening, we draw on the Twitter snowflake algorithm to achieve a global unique ID. Here is a simplified chart of the Order ID:
Divided into 3 parts:
1. Time stamp
The granularity of the timestamp here is the millisecond level, and when the order ID is generated, system.currenttimemillis () is used as the timestamp.
2. Machine number
Each order server is assigned a unique number, and when the order ID is generated, the unique number is used directly as the machine number.
3. Self-increment serial number
When there are multiple requests to generate an order ID in the same millisecond in the same server, the sequence number is increased in the current millisecond, and the sequence number continues to start at 0 in the next millisecond. For example, there are 3 requests to generate an order ID in the same millisecond of the same server, and the self-increment sequence number of the 3 order IDs will be 0,1,2 respectively.
With the 3-part combination above, we can quickly generate globally unique order IDs. But light overall is not enough, many times we will only be directly based on the Order ID query order information, at this time, because there is no UID, we do not know which sub-Library of the Sub-table query, traverse all the libraries of all the tables? This is obviously not going to work. So we need to add the information of the Sub-Library table to the order ID, here is the order ID with the Sub-Library table information Simplified structure diagram:
We have added the information of the sub-Library and the table in the header of the generated global order ID, so that we can quickly query the corresponding order information based on the order ID.
What exactly does the Sub-Library table information contain? The first part of the discussion, we have the order table by the UID dimension divided into 8 databases, 10 tables per database, the simplest library table information only a string length of 2 can be stored, the 1th storage database number, the value range 1 to 8, the 2nd table number, the value range 0 to 9.
or according to the first part based on the UID calculation database number and table number algorithm, when uid=9527, the library information = 1, sub-table information = 7, they are combined, two bits of the sub-database table information is "17". Specific algorithm flow see:
There is no problem with the use of the table number as a sub-table information, but the use of database numbers as the library information but there is a hidden danger, considering the future expansion needs, we need to expand the 8 library to 16 libraries, the value range 1 to 8 of the library information will not support 1 to 16 of the library, the library routing will not be completed We call the question of appeal as the loss of accuracy of the sub-Library information.
In order to solve the problem of the loss of the accuracy of the sub-Library information, we need to redundancy the accuracy of the sub-Library information, that is, we can now save the information of the library to support the future expansion. Here we assume that eventually we will be expanding to 64 databases, so the new library information algorithm is:
Sub-Library information = (UID/10)% 64 + 1
When uid=9527, according to the new algorithm, the library information = 57, here 57 is not the real database number, it is redundant to the last expansion to 64 databases of the accuracy of the sub-Library information. We currently have only 8 databases, and the actual database number needs to be calculated according to the following formula:
Actual database Number = (sub-Library information-1)% 8 + 1
When uid=9527, the sub-Library information = 57, the actual database number = 1, the Library sub-table information = "577".
Since we choose the 64来 to save the precision of the module, the length of the library information is changed from 1 to 2, and the length of the last sub-database table information is 3. The specific algorithm flow can also be found in:
As shown, in the calculation of the sub-Library information is used in the mode 64 redundancy of the sub-database information accuracy, so that when our system needs to expand to 16 libraries, 32 libraries, 64 libraries will no longer have problems.
The order ID structure above has been able to meet our current and subsequent expansion requirements, but given the uncertainty of the business, we added 1 bits in front of the order ID to identify the order ID of the version, this version number is redundant data, is not used. Here is the final Order ID simplified structure diagram:
Snowflake algorithm: Github.com/twitter/snowflake
Three. Eventual consistency
So far, we have implemented the ultra-high concurrent write and update of order table through the sub-list of the UID dimension of order table, and can query the order information through UID and order ID. However, as an open group payment system, we also need to query the order information through the line of Business ID (also known as Merchant ID, bid), so we introduced the bid dimension of order table cluster, the UID dimension of the Order table cluster redundant to the bid dimension of the Order table cluster, To query order information according to bid, simply check the order table cluster of the bid dimension.
The above scenario is simple, but maintaining data consistency across the two order table clusters is a cumbersome task. Two table clusters obviously in different database clusters, if the introduction of strong consistency of distributed transactions in the write and update, this will undoubtedly greatly reduce the system efficiency, increase service response time, which is unacceptable to us, so we introduced the message queue asynchronous data synchronization, to achieve the final data consistency. Of course, the various exceptions to the message queue can also result in inconsistent data, so we introduced the real-time monitoring service, real-time computing two clusters of data differences, and consistency synchronization.
The following is a simplified consistency synchronization diagram:
Four. High availability of databases
No machine or service can guarantee stable operation on line. For example, at a certain time, a database main library down, then we will not be able to read and write operations on the library, online services will be affected.
The so-called database high availability refers to: When the database for a variety of reasons for problems, can be real-time or rapid recovery of database services and patching data, from the perspective of the whole cluster, as if there is no problem. It is important to note that the Recovery database service here does not necessarily mean repairing the original database, but also switching the service to another standby database.
The main work of database high availability is database recovery and data patching, we usually take the time to complete these two tasks, as a measure of the high availability of good or bad standards. There is a vicious circle problem, the longer the database is restored, the more inconsistent data, the longer the data patching time, and the longer the overall repair time. So the rapid recovery of the database is the top priority of the database, imagine if we can complete the database recovery within 1 seconds of the database failure, the repair of inconsistent data and cost will be greatly reduced.
is one of the most classic master-slave Structures:
There are 1 Web servers and 3 databases, where DB1 is the main library, and DB2 and DB3 are from the library. Here we assume that the Web server is maintained by the project team, while the database server is maintained by the DBA.
When a problem occurs from the library DB2, the DBA notifies the project group that the project group will DB2 removed from the Web service's configuration list, restart the Web server, so that the faulted node DB2 will no longer be accessed, the entire database service is restored, and the project group will add DB2 to the Web service when the DBA fixes the DB2.
When a problem occurs in the main library DB1, the DBA switches DB2 to the primary library and notifies the project group that the project group uses DB2 to replace the original main library DB1 and restart the Web server so that the Web service will use the new main library DB2, and DB1 will no longer be accessed and the entire database service is restored. When the DBA fixes the DB1, the DB1 is then used as the DB2 from the library.
The classic structure above has a big drawback: both the DBA and the project team need to work together to recover the database service, regardless of the main library or from the library, which is difficult to automate and is too slow to recover.
We believe that the database operations should be separated from the project team, when the database problems, should be the DBA to achieve a unified recovery, do not need the project team operations services, so as to facilitate automation, shorten service recovery time.
Let's take a look at the available structure diagram from the library:
As shown, the Web server will no longer directly connect from the library DB2 and DB3, but instead connect the LVS load balancer, which is connected from the library by LVS. The advantage of this is that LVS is automatically aware of the availability of the library, and the LVS will not send the Read data request to DB2 after the DB2 down from the library. At the same time the DBA needs to increase or decrease from the library node, only need to operate the LVS independently, no longer require the project group update configuration files, restart the server to cooperate.
Then look at the main library high-availability structure diagram:
As shown, the Web server will no longer directly connect to the main library DB1, but instead connect the keepalive virtual IP, and then map this virtual IP to the main library DB1, adding Db_bak from the library to synchronize the data in the DB1 in real time. Normally the web reads and writes data in DB1, and when DB1 goes down, the script automatically sets Db_bak to the main library and maps the virtual IP to Db_bak, and the Web service uses a healthy Db_bak as the primary library for read and write access. This will take only a few seconds to complete the primary database service recovery.
Combine the above structure to get the master-slave high-usable structure diagram:
Database high availability also contains data patching, because we are operating core data, the first log and then perform the update, coupled with the implementation of near real-time fast recovery database services, so the amount of data patching is not enough, a simple recovery script can quickly complete the data repair.
Five. Data grading
Payment system In addition to the most core payment order form and payment flow chart, there are some configuration information tables and some user-related information tables. If all of the read operations are done on the database, the system performance will be compromised, so we have introduced a data classification mechanism.
We simply divide the data of the payment system into 3 levels:
Level 1th: Order data and payment flow data; These two pieces of data require a high degree of real-time and accuracy, so no caching is added, and the read-write operation directly operates the database.
Level 2nd: User-related data, which is related to the user, has a feature of read-write-less, so we use Redis for caching.
Level 3rd: Payment configuration information; This data is not user-independent, has a small amount of data, frequent read, almost unmodified features, so we use local memory for caching.
There is a data synchronization problem with the local memory cache because the configuration information is cached in memory and local memory is not aware of the configuration information being modified in the database, which can cause inconsistencies in the data and local in-memory data in the database.
In order to solve this problem, we developed a highly available message push platform, when the configuration information is modified, we can use the push platform, to the payment system all the server Push profile update message, the server receives the message will automatically update the configuration information, and give a successful feedback.
Six. Pipe thickness
Hacker attacks, front-end retries and other reasons can cause a surge in requests, if our service is a surge of requests to a wave of death, want to restore, is a very painful and cumbersome process.
For example, our current order processing capacity is an average of 100,000 orders per second, the peak of 140,000 orders per second, if the same second there are 1 million orders to enter the payment system, there is no doubt that our entire payment system will crash, the subsequent flow of requests will make our service cluster can not start up, The only way to do this is to cut all traffic, restart the entire cluster, and slowly import traffic.
We add a layer of "thick pipe" to the external Web server, it can solve the above problem well.
The following is a simple structure diagram of a thick pipe:
Take a look at the above diagram, where HTTP requests go through a thick pipe before entering the Web cluster. The entry side is foul language, we set the maximum support to 1 million requests per second, the extra requests will be discarded directly. The egress end is a fine port, which we set to the Web cluster 100,000 requests per second. The remaining 900,000 requests are queued in the thick pipeline, waiting for the Web cluster to finish processing the old request before a new request comes out of the pipeline for the Web cluster to process. In this way, the number of requests processed by the Web cluster never exceeds 100,000 per second, and under this load, each service in the cluster will be run by the university, and the whole cluster will not stop the service because of the sudden increase of requests.
How do I implement a thick pipe? Nginx Commercial version of the support, related information, please search
Nginx Max_conns, it is important to note that Max_conns is the number of active connections, in addition to the need to determine the maximum TPS, but also to determine the average response time.
Nginx Related: http://nginx.org/en/docs/http/ngx_http_upstream_module.html
Processing 100,000 orders per second Le group payment structure