Alipay supports the technical realization of the peak of database request for double eleven 42 million times / sec

Source: Internet
Author: User
Tags distributed architecture relational database singles day obproxy oceanbase database

Everyone who cares about the OceanBase database, everyone! I am Jiang Zhiyong of the OceanBase team. Leverage the DBAplus community live broadcast platform to chat with you about the development and key features of OceanBase in the past eight years.

First, the development process

The OceanBase database is a financial-grade distributed relational database system independently developed by Alibaba and Ant Financial. It differs from the solution based on the open source database product transformation: the OceanBase kernel is more than 1 million lines of code written by our classmates. Come out, so we have complete control over it, which is very important for the continued development of OceanBase and its wider application.

In the eight years since the project was started in 2010, the OceanBase version has also been upgraded from version 0.1 to the upcoming version 2.0. From the original Key-Value storage system to the now fully functional relational database system. Throughout the process, our unchanging initial focus is on the service business, solving practical business problems, continuously enhancing product capabilities, and then better serving the business.

Follow the cycle of "solving problems → developing products → solving bigger problems → training better products":

  • OceanBase started from solving the huge data storage problem of favorites, and had a small team and survived;

  • In order to solve the problem of high availability, a three-cluster deployment mode is adopted;

  • Added SQL functionality to reduce business migration costs and support analytic applications;

  • To the current full peer-to-peer architecture, three-center five-city urban-level automatic disaster recovery (refer to: "new normal" scheme for urban-level fault automatic lossless disaster recovery), support mainstream relational database function to make business zero-modified migration, and finally make Alipay's core business can run on OceanBase.

OceanBase starts with the project and one of its goals is to provide reliable relational database services on unreliable hardware. We were born in the fast-growing Internet industry. The ordering cycle for high-end storage and dedicated servers is too long and supply is limited. The hardware that can be easily obtained is only a common PC server, so OceanBase can only rely on distributed architecture to achieve financial-grade reliability and data consistency in an unreliable hardware environment by means of software.

After more than eight years of practice, from Taobao's collection business to today to support all of Alipay's core business, and in the annual "Double Eleven Singles Day" continue to create a world record for the transaction database peak processing capacity. In the "Double Eleven Singles Day" promotion last year, it supported all of Alipay's core businesses (including transactions, payments, membership, and accounts), and the number of peak processing requests reached 42 million times per second.

Second, the gradual realization of key features

In terms of features, OceanBase is characterized by linear scalability, high availability, high performance, low cost, and high compatibility with mainstream relational database products.

Cluster architecture features

Since version 1.0, the OceanBase architecture has evolved into a full peer node with no shared storage. This architectural feature eliminates a single point, and each node has full processing power to manage the data on this node. In the node role, there are several nodes (root service) responsible for managing the global information such as the cluster topology. It is relatively special, but each node has the ability to assume this role. If the node currently responsible for this role fails, the cluster will Automatically elect new nodes to take on this role.

In addition, for high availability, the cluster nodes are distributed in a plurality of different available areas, which may be different computer rooms in the same city, or multiple computer rooms in different places; one data has copies in multiple available areas, and the number of copies is usually an odd number. In the practice of Ant Financial, usually three or five, a few copy failures do not affect system availability.

Million QPS front-end agent

On top of the OceanBase cluster, we provide a reverse proxy OBProxy. Seeing this, you may think of a MySQL cluster built on middleware, but there is a fundamental difference between the two: Simply put, without OBProxy, OceanBase clusters can work as well, with full processing power.

Then why do we have OBProxy? Mainly for two reasons:

  • One is performance. Through the routing function of OBProxy, the statement can be routed to the appropriate node more accurately, reducing the internal forwarding of the cluster;

  • The other one is fault tolerance. In the case of network flashing, OBProxy can re-establish the connection and make the service unperceived.

Distributed architecture is transparent to the business

What is the best state that the OceanBase distributed architecture can do for the business? I think it is transparent to the business. Through the distributed architecture, we are highly available and scalable, but for the business system, to be transparent, it is represented as a single-node database, which is reflected in the following points:

The service does not need to care about the physical location of the database object. When the service is connected to OBProxy or any OB node, you can see the complete view and access all the data it has access to.

  • Cluster SQL feature set is equivalent to the single-node SQL feature set. Standard SQL syntax is used and does not affect SQL functionality due to data distribution. Most of the features in the current version do not matter the data location, but there are a few features that are affected by the location. For example, we do not support modifying the data of multiple nodes in a DML statement. These issues will be resolved in the upcoming 2.0 release.

  • The fully supports the transaction ACID feature and the unified transaction operation interface. The business does not need to distinguish between distributed transactions and stand-alone transactions. The database internally distinguishes different scenarios and optimizes them accordingly.

  • Automatically handles distributed environment failures and makes services non-perceived. Through the retry mechanism of OBServer and OBProxy, most of the environmental faults can be transparent to the service, but there are some differences compared with the stand-alone systems previously built on high-reliability hardware.

Linear expansion

In the process of meeting the rapid development of the business, the primary problem to be solved by the OceanBase database is the scalability problem.

With the full peer node architecture, we eliminated the previous version of the single-point write bottleneck. The business requirements for the database are non-stop service, always online, for which all operations must be online. The traditional method of vertical expansion is not enough, and only horizontal expansion can be adopted. This is also very intuitive in terms of method. How to do it? Take three steps:

Add nodes to the cluster → Let the new nodes have service capabilities → Distribute a portion of the load to the new node

It seems that the steps are as clear as putting the elephant in the refrigerator. But every step is not so good.

Because OceanBase is shared-free storage, if the newly added nodes can share the load, the new node must have data first. The hardest part is to ensure data consistency and business impact in this process. Whether it's room expansion (new machines in the room) or expansion to new rooms (most likely off-site or public cloud scenarios), we must all be online. In the implementation of OceanBase, the main reasons are as follows:

  • Multiple copy mechanism. Multiple copies are not only the basis for high availability, but also the basis for online expansion. In essence, there are two kinds of expansion: one is to write new data to the new node, and the subsequent reading and writing of the part of the data is also on the new node; the other is to move part of the data on the current node to The new node provides services on the new node.
    The first case is easy to handle; the second case requires the use of multiple copy mechanisms. Simply understanding is to migrate one of the copies from the original node to the new node. It is simple to say, and there are many details to consider. Adding a copy to a different location, how to efficiently migrate data without affecting existing services.
    Reference: Multi-type copy mashup: a tool to reduce the cost of clusters

  • Fine-grained master-slave relationship. Most of the traditional master and backup are node-level. Because of the large amount of data stored on one node, the impact of active/standby switching is very large. In OceanBase, the granularity of the active and standby relationships is partition-level. This is a very fine granularity. The impact of handover on the service is relatively small, and the handover is second-order.

  • The location information is automatically refreshed. After the expansion causes the partition location to change, the system detects the change when the first location is accessed, and refreshes the location information to retry. After successful execution, the correct result is returned to the client. In addition to the OBServer side, OBProxy will update its reserved location information based on the feedback from the server. Subsequent accesses will be routed directly to the correct node without intra-cluster forwarding.

Expansion is online, as is shrinking.

High availability

High availability is the foundation of the OceanBase database. The following three articles describe this in detail:

  • The lack of high availability of traditional relational databases

  • How do achieve financial grade high availability on unreliable hardware?

  • Based on the limitations of Raft distributed consistency protocol and its risk to the database

They include "the difference between OceanBase and traditional databases, and why do we choose the Paxos protocol instead of the Raft protocol that is easier to understand in the election agreement?" Here is a brief summary as follows:

  • Traditional database high availability relies on dedicated hardware, while OceanBase is highly available on common commercial hardware.

  • The traditional database lacks the arbitration mechanism in the event of a failure, and requires people to choose between not losing data and non-stop service; OceanBase can automatically recover based on the Paxos protocol in the case of a minority copy failure, without data loss (RPO=0), Non-stop service (the affected partition RTO is less than 30 seconds).

The reason why d47e62d2b349aca45e42305ed6714efbe5ed61d9 uses the Paxos protocol instead of the Raft protocol is that the Raft protocol requires that the log confirmation must be sequential. If the previous log is not confirmed for various reasons, the subsequent logs cannot be confirmed. This has a serious consequence, making operations that are not dependent on each other at the business logic level interdependent and have a very large impact on system throughput. Especially at high loads. This kind of dependency is unpredictable and evasive by business developers and DBAs, and cannot be effectively resolved when it occurs.

In addition to availability under abnormal conditions, planned operations such as system upgrades and DDL changes cannot affect system availability. We upgrade the gradation and rollback of the upgrade process by means of a zone-by-one upgrade. At the same time, the data consistency check between multiple zones is used to verify the correctness of the newly upgraded system.

By implementing multi-version database object definitions, we implement DDL operations and query and update operations without waiting for each other. For multi-version database object definitions, add one more sentence, the impact of DDL operations on database objects can be seen by the subsequent operations of the same user Session, even if this user session corresponds to multiple sessions of OBProxy to different servers in the cluster.

High performance

High performance means not only the ability to service, but also the cost reduction, which can support larger business under the same hardware conditions. OceanBase uses the LSM tree to temporarily store updates in memory and maintain two types of indexes in memory: Hash index and B+ tree index, which are used to speed up primary key query and index range query respectively to achieve quasi-memory. Database performance.

Different from the traditional database: the update operation is not carried out in place, there is no write amplification problem of the traditional database, for a system of a general size, the memory can be put down one day increment. When the system is running at high load, there is almost no data file write, which will greatly reduce IO; when the system load is light, the memory incremental batch is merged into the persistent storage.

In addition to the performance improvement brought by the read-write separation architecture, we have made a lot of optimizations on the entire execution link, including the following categories:

  • Cache mechanism. There are row buffers and block caches at the data level. For frequently accessed data, IO reads are greatly reduced; at the SQL engine level, there is an execution plan cache.

  • JIT compilation execution. In the expression calculation and storage execution process, compile and execute are supported, which greatly accelerates the calculation of a large number of data rows.

  • Adaptive ability. The SQL engine chooses to use local, remote, distributed execution based on the distribution of statement operation data and choose the appropriate plan based on the cost. In the case of distributed execution, the calculation is reduced to the node as much as possible according to the cost calculation. The transaction layer tries to use local transactions as much as possible, reducing distributed transactions to improve performance.

  • Shared worker threads and asynchronous. Different from the traditional database, OceanBase does not use a proprietary worker thread, and the worker thread shares multiple sessions. In addition, asynchronous operation is used in multiple executions during statement execution and transaction commit, minimizing unnecessary waits and making full use of the CPU.

In addition to the above, we have done a lot of detailed work. Overall, the effect is very obvious: in 2017, the "Double Eleven Singles Day" peaked at 42 million operations per second, the user experience was quite smooth, and the system performance was very stable.

Low cost

On the one hand, OceanBase needs to meet the requirements of the database for the business, on the other hand, it also saves costs, not only the operating costs, but also the business migration and personnel learning costs.

The cost advantage of OceanBase mainly comes from the following three points:

  • Reduces the computational cost by performing path performance improvement;

  • Reduces the storage cost through the advantages of read-write separation architecture. A real case is that an internal service is migrated from Oracle to OceanBase, and the original 100TB is reduced to 30 terabytes. Because OceanBase can use a compression algorithm with a higher compression ratio, in Oracle, because it is an in-place update to balance performance, it is impossible to use a high compression ratio compression algorithm that consumes too much CPU;

  • Through multi-tenant architecture, services of different load types can be fully utilized by multi-tenant mode, which can make full use of machine resources (refer to: Multi-tenancy mechanism overview). From the practical point of view, the financial cost of the OceanBase database is only 1/5 to 1/10 of the original account cost.


The above features make OceanBase competitive, but to migrate the business from the original system to OceanBase requires an additional feature - compatibility. High compatibility allows system migration to take place at a controlled cost.

For the company's internal MySQL business, OceanBase can achieve zero-change migration of the business. It can be said that OceanBase is a common feature of the mainstream relational database. Not only at the grammatical level, but also in terms of user experience and business experience.

Since 2017, OceanBase has been serving external customers. We have found that compatibility is not enough. We need to go one step further: not only are common functions compatible, but also fully compatible; not only compatible with MySQL, but also compatible with commercial databases. . Only in this way can the external business have the possibility to change OceanBase with zero modification. These jobs are exactly what we are currently doing.

Third, the future outlook

Next, OceanBase 2.0 will be released soon, and technology sharing of new features will be carried out in the DBAplus community. This new version has a qualitative improvement over the 1.X version in terms of distributed capabilities and SQL functionality. We also sincerely hope that OceanBase 2.0 will make the distributed architecture transparent to the business and make the business more convenient to get a better database. service.


Q1: Does OceanBase support complex queries such as multi-table joins, groupings, and window functions?

A: Support.

Q2: Does Paxos guarantee multi-copy consistency?

A: Yes.

Q3: How does the OceanBase query optimizer compare to Oracle? Is there a problem with SQL performance degradation caused by Oracle execution plan changes?

A: There is still a gap between the optimizer and Oracle. It also needs to solve the stability problem.

Q4: Is the data on each zone different, and there are three copies on each zone?

A: Usually each zone data is the same, and one copy of the data has a copy in each zone.

Q5: How many copies of OceanBase are used, how to efficiently synchronize between copies, and how to maintain efficient reading and writing efficiency from multiple copies?

A: For strong and consistent reading, you can only read the Lord.

Q6: How does DDL achieve online?

A: Multi-version Schema.

Q7: If the query design multiple tables, and the sub-table information of the allocated MergeServer cache is incomplete, is it necessary to complete multiple requests by MergeServer?

A: After 1.0, there is no MergeServer, it is a full peer node. Queries involving multiple nodes generate a distributed plan.

Q8: Is the data across the area synchronized in real time?

A: Depends on the distribution of the copy and the network delay.

Q9: Does Ali have any special optimizations for super hotspot data in the event scenario of Double Eleven Singles Day? What is the performance of other mainstream databases?

A: For example, in advance unlocking the lock, the performance can meet the big demand.

Q10: What should I do if MergeServer hangs during the process of getting ChunkServer data?

A: No such roles are now distinguished. If the node fails, it will be retried.

Q11: OceanBase What do you mean by Zone is the Shard shard?

A: Zone is the meaning of the available zone, you can understand it as a sub-cluster.

Q12: Does OceanBase support different storage engines for OLAP and OLTP systems, or is it supported by the same engine?

A: A storage engine.

Q13: Does OceanBase have no Shard function? Then provide high availability, city-level disaster recovery, online expansion, remember that you may not be in the same node, is that data fragmentation?

A: The data is fragmented, and the rules for fragmentation are determined by the user, similar to the Oracle partition.

Q14: How to ensure that every time you read the latest or consistent data? How do you read the Paxos Instance?

A: Strong consistency read read-only master copy.

Q15: How is the copy between multiple copies? Physical log? Logical log? Or other methods?

A: Physical log.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.