High-performance service-side optimization

Source: Internet
Author: User
Tags app service

Dada CTO on how to take the high-performance service-side optimization path for start-up companies

Catalogue [-]

    • Business Scenarios
    • The initial technology selection
    • Read/write separation
    • Vertical Sub-Library
    • Horizontal sub-Library (sharding)
    • Summarize
# #业务场景达达是全国领先的最后三公里物流配送平台. Dada's business model is similar to Didi and Uber, leveraging social idle human resources in crowdsourcing to solve the real-time distribution challenges of the last three kilometres. Dada business mainly consists of two parts: merchant billing, distribution of delivery staff, as shown in. [Input picture Description] (https://static.oschina.net/uploads/img/201602/26150108_Bjdj.jpg "Enter picture title here") Dada's business grew enormously, growing from zero to nearly million a day in about 1 years, Provides great access pressure to the backend. There are two main types of pressure: reading pressure and writing pressure. Reading pressure from the distribution staff in the app to grab a single, high-frequency refresh query around the order, daily visits hundreds of millions of times, peak QPS up to thousands of times/second. Write pressure from merchant billing, Dada orders, pick up, complete and other operations. Dada's reading is much more stressful than writing, with a read request of about 30 times times the amount of write requests. Is up to the past 6 months, daily traffic changes trend map. [Input picture Description] (https://static.oschina.net/uploads/img/201602/26150119_r6ej.jpg "Enter picture title here") is up to the last 6 months, peak request for the change trend graph of QPS. [Input picture Description] (https://static.oschina.net/uploads/img/201602/26150128_WclE.jpg "Enter picture title here") The fast-growing business, the higher the demand for technology, we must be well-prepared for the architecture To meet the challenges of the business. Next, let's look at how Dada's backend architecture evolves. # #最初的技术选型作为创业公司, the most important point is the agile, rapid implementation of products, external services, so we chose the public cloud services, to ensure rapid implementation and scalability, saving the self-built room and other time. In the technology selection, in order to quickly respond to business needs, the business system uses Python as the development language, the database uses MySQL. As shown, the application tier has access to a database of several large systems. [Input picture Description] (https://static.oschina.net/uploads/img/201602/26150146_NMSw.jpg "Enter picture title here") # #读写分离随着业务的发展, the speed of the traffic grows, The above scenarios quickly fail to meet performance requirements. The response time for each request is getting longer, such as when the distribution clerk refreshes the surrounding order in the app, and the response time increases from the initial 500 milliseconds to more than 2 seconds. During peak business hours, the system has even experienced outages, some businesses and distributionStaff even doubted the quality of our services. At this critical juncture of life and death, we discovered that Gao Yan MySQL CPU usage is close to 80%, disk IO usage is approaching 90%,slow query from 100 to 10,000 per day, and it is more severe than one day. The database has become a bottleneck, and we have to do a quick architecture upgrade. Here is a database of QPS changes for the week. [Input picture Description] (https://static.oschina.net/uploads/img/201602/26150238_yg4r.jpg "Enter picture title here") when the Web App service is having a performance bottleneck, Because the service itself is stateless (stateless), we can solve it by adding the horizontal extension of the machine. The database obviously cannot be extended by simply adding machines, so we take the MySQL master-slave synchronization and the application service-side read-write separation scheme. MySQL supports master-slave synchronization, in real-time copying the main library's data incrementally to the slave library, and a master library can connect multiple slave libraries. Using this feature, we make a read-write judgment on each request at the application server, and if the request is written, all the DB operations within the request are sent to the main library, and if the request is read, all the DB operations within the request are sent to the slave library, as shown in. [Input picture Description] (https://static.oschina.net/uploads/img/201602/26150254_O0nr.jpg "Enter picture title here") to achieve read and write separation, the database pressure reduced a lot, CPU usage and IO usage have dropped to 5%, and Slow query is approaching 0. Master-Slave synchronization, read and write separation to us mainly brings the following two benefits: * reduced the main library (write) pressure: Dada's business is mainly from the reading operation, read and write separation, reading pressure transferred to the library, the main library pressure reduced by dozens of times times. * from the library (read) can be horizontally extended (plus from the library machine): Because the system pressure is mainly read requests, and can be horizontally extended from the library, when the pressure from the library too, can be directly added from the library machine, to alleviate the read request pressure. The following is a graph of the changes in the database QPS after optimization: Read and write select qps! of the main library before and after separation [Input picture Description] (https://static.oschina.net/uploads/img/201602/26150437_HBa3.jpg "Enter image title here") Read and write the Select qps! of the detached base library [Input picture Description] (https://static.oschina.net/uploads/img/201602/26150453_sI9P.jpg "Enter picture title here") of course, none of the schemes are omnipotent. Read/write separation, temporarily solves the MySQL pressure problem, but also brings the new challenge. During peak business hours, when the merchant finishes the order, the order is not visible in my order list (typical read after write), and occasionally there are some exceptions that cannot be queried for data within the system. Through monitoring, we found that during peak business hours MySQL may have a master-slave delay, in extreme cases, with a master-slave delay of up to 10 seconds. How to monitor the master-slave synchronization status? From the library machine, perform show slave status, view the Seconds_behind_master value, represent the master-slave synchronization from the library behind the main library time, in seconds, if the same slave synchronization without delay, this value is 0. One important reason for MySQL master-slave delay is that master-slave replication is single-threaded serial execution. How to avoid or solve the master-slave delay? We did the following optimizations: * Optimize MySQL parameters, such as increase innodb_buffer_pool_size, let more operations in MySQL memory, reduce disk operation. * Use a high-performance CPU host * Database to use physical hosts, avoid using virtual cloud hosts, improve IO performance * Use SSD disk, improve IO performance. The SSD's random IO performance is about 10 times times that of a SATA drive. * Business code optimization, some operations that require high real-time performance, read operations using the main Library # #垂直分库读写分离很好的解决读压力问题, each reading pressure increases, can be added from the library horizontally scale. However, the pressure of writing operations with the growth of the business has not been very effective mitigation methods, such as the slower the merchant bills, seriously affecting the use of the business experience. We monitor that database write operations are getting slower, a normal insert operation, and may even execute more than 1 seconds. Is the database main library pressure, visible disk IO utilization is very high, peak IO response time of up to 636 milliseconds, IO utilization rate up to 100%. [Input picture Description] (https://static.oschina.net/uploads/img/201602/26150506_y7fj.jpg "Enter picture title here") at the same time, the business is becoming more and more complex, multiple application systems use the same database, One of the very small non-core features of slow query often affects other core business functions on the main library. We have an application to log logs in MySQL, the log volume is very large, nearly 100 million lines of records, and this table ID is the UUID, one day peak, the whole system suddenly slowed down, which caused the outage. Monitoring found that this table insert is very slow, slow down the entire MySQL Master, and then dragged across the system. (Logging in MySQL is not a good design, of course, so we've developed a big data log system.) On the other hand, the UUID key is a badThe choice of cake, in the following level of the library, for the generation of ID, there is a more in-depth narration. At this point, the main library becomes a performance bottleneck, and we realize that it is necessary to do the schema upgrade again, split the main library, on the one hand to improve performance, on the other hand, reduce the interaction between the system to improve system stability. This time, we split the system vertically by business. As shown, the initial large database is split into different business databases by business, and each system accesses only the corresponding business database, avoiding or reducing cross-library access. [Input picture Description] (https://static.oschina.net/uploads/img/201602/26150516_R0Jw.jpg "Enter picture title here") is the pressure of the database main library after the vertical split, the visible disk IO utilization has been reduced a lot, Peak IO response time is within 2.33 milliseconds, with IO utilization up to 22.8%. [Input picture Description] (https://static.oschina.net/uploads/img/201602/26150526_R6SF.jpg "Enter picture title here") The future is beautiful, the road is tortuous. Vertical sub-Library process, there are many challenges, the biggest challenge is: not cross-Library join, but also need to refactor the existing code. Library, you can simply use the Join association table query, after the library, the split database on different instances, you cannot cross the library to use the join. For example, in the CRM system, you need to search through the merchant name All orders for a merchant, before the vertical sub-Library, you can join the Merchant and order table to do the query, as follows: "' Sqlselect * from Tb_order where supplier_id in (SELECT ID From supplier where name= ' Shanghai Undersea Fishing '); After the library, you want to refactor the code, first through the merchant name to query the merchant ID, and then through the Merchant ID Query Order table, as follows: "' Sqlsupplier_ids = SELECT ID from sup Plier where name= ' Shanghai submarine Fishing ' select * from Tb_order where supplier_id in (supplier_ids) ' Vertical library lessons have led us to develop SQL best practices, One is to disable or less join in the program, but to assemble the data in the program to make SQL easier. On the one hand, to prepare for further vertical split business, on the other hand, it also avoids the low performance of join in MySQL. After one weeks of intensive infrastructure tuning and business code refactoring, the vertical disassembly of the database was finally completed.Score of After splitting, each application only accesses the corresponding database, on the one hand the single-point database is split into multiple, the allocation of the main library write pressure, on the other hand, the separation of the database independent, to achieve business isolation, no longer affect each other. # #水平分库 (sharding) read-write separation, by extending from the library level, to solve the reading pressure; the vertical sub-library caches write pressure by splitting the main library by business, but the system still has the following hidden dangers: * The volume of single-table data is getting larger. such as the order form, the number of single-table records will soon be over billion, beyond the limits of MySQL, affecting read and write performance. * Core Business Library is more and more write pressure, can no longer be a vertical split, Mysql Master Library does not have the ability to scale horizontally. Previously, the system pressure forced us to upgrade the architecture, this time, we need to upgrade the architecture in advance to achieve the level of database expansion (sharding). Our business is similar to Uber, and Uber has implemented a level sub-Library 5 years (2014) years after the company was founded, but our business development requires that we start implementing a level sub-library in 1 August. The logical architecture diagram looks like this:! [Input picture Description] (https://static.oschina.net/uploads/img/201602/26150542_li6f.jpg "Enter picture title here") The first question to be faced by a horizontal library is what logic is split. One option is to divide by city, all data in one city is in one database, and the other is to split the data evenly by order ID. According to the advantages of city split is the high degree of data aggregation, to do aggregate query is relatively simple, the disadvantage is that the data distribution is not uniform, some cities have a large amount of data, the hot spots, and these hotspots may be forced to split again later. Split by order ID is the opposite, the advantage is that the data evenly distributed, does not appear a database of large or small data, the disadvantage is that the data is too scattered, not conducive to the aggregation of queries. For example, after splitting by order ID, a merchant's order may be distributed across different databases, querying all orders for a merchant, and may need to query multiple databases. For this scenario, a solution is to make redundant tables of data that require aggregate queries, redundant tables are not split, and aggregate queries are reduced during the business development process. After weighing the pros and cons, and referring to the sub-Libraries of Uber and other companies, we finally decided to do a horizontal sub-library by order ID. From the architecture, we divide the system into three layers: * Application layer: That is, all kinds of business application system. * Data Access Layer: A unified data access interface for the upper layer of application layers masking read-write sub-Library, sub-Library, cache and other technical details. * Data layer: The DB data is fragmented, and shard shards can be added dynamically. The key point of the horizontal sub-Library is the design of the data access layer, the data access layer mainly consists of three parts: * ID Generator: Generate primary key per table * Data source routing: route each DB operation toDifferent shard data sources * cache: using Redis to implement the data cache, improve the Performance ID generator is the core of the entire horizontal library, it determines how to split the data, as well as query storage-retrieve data. The ID needs to be globally unique across libraries, or it will cause a conflict in the business layer. In addition, the ID must be numeric and ascending, mainly considering that the ascending ID will guarantee the performance of MySQL. At the same time, the ID generator must be very stable, because any failure can affect all database operations. Our ID generation strategy draws on Instagram's ID generation algorithm. The specific plan is as follows:! [Input picture Description] (https://static.oschina.net/uploads/img/201602/26150557_ekZW.jpg "Enter picture title here") the entire ID has a binary length of 64 bits with a timestamp of the first 36 bits, To ensure that the ID is increased in ascending order, the intermediate 13 bits are the library identifier, which identifies the database in which the current ID corresponds and the last 15 bits are the self-increment sequence to ensure that the ID is not duplicated in the same second. Each shard library has a self-increment sequence table, when the self-increment sequence is generated, the current self-increment sequence value is obtained from the self-increment sequence table, plus 1, as the last 15-bit # #总结创业是与时间赛跑的过程 of the current ID, in order to quickly meet the business needs, we adopt simple and efficient solutions, such as the use of cloud services, The application service accesses the single point DB directly, and the performance and stability are gradually taken into consideration as the system pressure increases, and the db is most prone to performance bottlenecks, we adopt the program of read/write separation, vertical sub-Library and horizontal sub-library. In the face of high performance and high stability, the architecture upgrade needs to be completed as far ahead as possible, otherwise, the system may be slow or even downtime.Business Scenarios

Dada is the country's leading final three km logistics distribution platform. Dada's business model is similar to Didi and Uber, leveraging social idle human resources in crowdsourcing to solve the real-time distribution challenges of the last three kilometres. Dada business mainly consists of two parts: merchant billing, distribution of delivery staff, as shown in.

Dada's business grew enormously, with a 1-year-old growth from zero to nearly millions per day, giving the backend a tremendous amount of access pressure. There are two main types of pressure: reading pressure and writing pressure. Reading pressure from the distribution staff in the app to grab a single, high-frequency refresh query around the order, daily visits hundreds of millions of times, peak QPS up to thousands of times/second. Write pressure from merchant billing, Dada orders, pick up, complete and other operations. Dada's reading is much more stressful than writing, with a read request of about 30 times times the amount of write requests.

Is up to the last 6 months, daily traffic change trend map.

It is up to the last 6 months that the peak request for the QPS change trend chart.

We need to be fully prepared to meet the challenges of business as we grow our business with increasing demands on technology. Next, let's look at how Dada's backend architecture evolves.

The initial technology selection

As a start-up company, the most important point is the agile, rapid implementation of products, external services, so we chose the public cloud services, to ensure rapid implementation and scalability, saving the self-built room and other time. In the technology selection, in order to quickly respond to business needs, the business system uses Python as the development language, the database uses MySQL. As shown, several large systems of the application tier access a database.

Read/write separation

With the development of the business, the rapid growth of traffic, the above-mentioned solution will soon not meet the performance requirements. The response time for each request is getting longer, such as when the distribution clerk refreshes the surrounding order in the app, and the response time increases from the initial 500 milliseconds to more than 2 seconds. During peak business hours, the system has even experienced outages, and some businesses and distributors even doubt the quality of our services. At this critical juncture of life and death, we discovered that Gao Yan MySQL CPU usage is close to 80%, disk IO usage is approaching 90%,slow query from 100 to 10,000 per day, and it is more severe than one day. The database has become a bottleneck, and we have to do a quick architecture upgrade.

The following is the database week of the QPS change graph.

When there is a performance bottleneck in the Web application service, because the service itself is stateless (stateless), we can solve it by adding the horizontal extension of the machine. The database obviously cannot be extended by simply adding machines, so we take the MySQL master-slave synchronization and the application service-side read-write separation scheme.

MySQL supports master-slave synchronization, in real-time copying the main library's data incrementally to the slave library, and a master library can connect multiple slave libraries. Using this feature, we make a read-write judgment on each request at the application server, and if the request is written, all the DB operations within the request are sent to the main library, and if the request is read, all the DB operations within the request are sent to the slave library, as shown in.

After the read and write separation, the database pressure is reduced a lot, CPU usage and IO usage are down to 5%, and Slow query is approaching 0. Master-Slave synchronization, read and write separation to us mainly brings the following two benefits:

    • reduced the main library (write) pressure: Dada's business mainly from reading operations, read and write separation, reading pressure transferred to the library, the main library pressure decreased by dozens of times times.
    • From the library (read) can be horizontally extended (plus from the library machine): Because the system pressure is mainly read requests, and from the library can be horizontally extended, when the pressure from the library too, can be directly added from the library machine, to alleviate the read request pressure.

The following is an optimized database of the QPS changes diagram:

Read and write select QPS for the main library before and after separation

Select QPS for read-write separation of the library

Of course, none of the solutions are omnipotent. The Read and write separation, temporarily solves the MySQL pressure problem, but also brings the new challenge. During peak business hours, when the merchant finishes the order, the order is not visible in my order list (typical read after write), and occasionally there are some exceptions that cannot be queried for data within the system. Through monitoring, we found that during peak business hours MySQL may have a master-slave delay, in extreme cases, with a master-slave delay of up to 10 seconds.

How to monitor the master-slave synchronization status? From the library machine, perform show slave status, view the Seconds_behind_master value, represent the master-slave synchronization from the library behind the main library time, in seconds, if the same slave synchronization without delay, this value is 0. One important reason for MySQL master-slave delay is that master-slave replication is single-threaded serial execution.

How to avoid or solve the master-slave delay? We have done some optimizations as follows:

    • Optimize MySQL parameters, such as increasing innodb_buffer_pool_size, allowing more operations to be done in MySQL memory, reducing disk operations.
    • Using high-performance CPU hosts
    • Database uses physical hosts to avoid using virtual cloud hosts to improve IO performance
    • Use SSD disks to improve IO performance. The SSD's random IO performance is about 10 times times that of a SATA drive.
    • Business code optimization, some operations that require high real-time performance, read operations using the main library
Vertical Sub-Library

Read-write separation is a good solution to the problem of reading pressure, each reading pressure increases, can be added from the library to scale horizontally. However, the pressure of writing operations with the growth of the business has not been very effective mitigation methods, such as the slower the merchant bills, seriously affecting the use of the business experience. We monitor that database write operations are getting slower, a normal insert operation, and may even execute more than 1 seconds.

Is the database main library pressure, visible disk IO utilization is very high, peak IO response time of up to 636 milliseconds, IO utilization of up to 100%.

At the same time, businesses are becoming more complex, with multiple applications using the same database, and a small non-core feature slow query that often affects other core business functions on the main library. We have an application to log logs in MySQL, the log volume is very large, nearly 100 million lines of records, and this table ID is the UUID, one day peak, the whole system suddenly slowed down, which caused the outage.

Monitoring found that this table insert is very slow, slow down the entire MySQL Master, and then dragged across the system. (Logging in MySQL is not a good design, of course, so we've developed a big data log system.) On the other hand, the UUID key is a bad choice, and in the horizontal sub-library below, there is a more in-depth narration of the ID generation.

At this point, the main library becomes a performance bottleneck, and we realize that it is necessary to do the schema upgrade again, split the main library, on the one hand to improve performance, on the other hand, reduce the interaction between the system to improve system stability. This time, we split the system vertically by business. As shown, the initial large database is split into different business databases by business, and each system accesses only the corresponding business database, avoiding or reducing cross-library access.

Is the pressure of the database main library after the vertical split, the visible disk IO utilization has been reduced a lot, peak IO response time is within 2.33 milliseconds, the IO utilization is up to 22.8%.

The future is beautiful, the road is tortuous. Vertical sub-Library process, there are many challenges, the biggest challenge is: not cross-Library join, but also need to refactor the existing code. Library, you can simply use the Join association table query, after the library, the split database on different instances, you cannot cross the library to use the join. For example, in the CRM system, you need to search through the merchant name All orders for a merchant, before the vertical sub-Library, you can join the Merchant and order table to do the query, as follows:

select *  from Tb_order where supplier_id in (select ID from  supplier  Where Name= ' Shanghai Undersea Fishing ');           

After the library, you want to refactor the code, first check the merchant ID through the merchant name, and then through the Merchant ID Query Order table, as follows:

Supplier_ids =select ID  from supplier where Name= ' Shanghai undersea fishing ' 

The lessons learned from the vertical library process have led us to develop SQL best practices, one of which is to disable or less join in the program, and to assemble the data in the program to make SQL easier. On the one hand, to prepare for further vertical split business, on the other hand, it also avoids the low performance of join in MySQL.

After one weeks of intensive infrastructure tuning and business code refactoring, the vertical split of the database was finally completed. After splitting, each application only accesses the corresponding database, on the one hand the single-point database is split into multiple, the allocation of the main library write pressure, on the other hand, the separation of the database independent, to achieve business isolation, no longer affect each other.

Horizontal sub-Library (sharding)

Read-write separation, by extending from the library level, to solve the reading pressure; the vertical sub-library caches write pressure by splitting the main library by business, but the system still has the following hidden dangers:

    • The volume of single-table data is getting larger. such as the order form, the number of single-table records will soon be over billion, beyond the limits of MySQL, affecting read and write performance.
    • The core business Library's write pressure is increasing, can no longer be in a vertical split, Mysql Master Library does not have the ability to scale horizontally.

Previously, the system pressure forced us to upgrade the architecture, this time, we need to upgrade the architecture in advance to achieve the level of database expansion (sharding). Our business is similar to Uber, and Uber has implemented a level sub-Library 5 years (2014) years after the company was founded, but our business development requires that we start implementing a level sub-library in 1 August. The logical schema diagram looks like this:

The first question that a horizontal library faces is what logic to split. One option is to divide by city, all data in one city is in one database, and the other is to split the data evenly by order ID. According to the advantages of city split is the high degree of data aggregation, to do aggregate query is relatively simple, the disadvantage is that the data distribution is not uniform, some cities have a large amount of data, the hot spots, and these hotspots may be forced to split again later.

Split by order ID is the opposite, the advantage is that the data evenly distributed, does not appear a database of large or small data, the disadvantage is that the data is too scattered, not conducive to the aggregation of queries. For example, after splitting by order ID, a merchant's order may be distributed across different databases, querying all orders for a merchant, and may need to query multiple databases. For this scenario, a solution is to make redundant tables of data that require aggregate queries, redundant tables are not split, and aggregate queries are reduced during the business development process.

After weighing the pros and cons, and referring to the sub-Libraries of Uber and other companies, we finally decided to do a horizontal sub-library by order ID. From the architecture, we divide the system into three layers:

    • Application layer: That is, all kinds of business application system.
    • Data Access layer: A unified data access interface, to the upper layer of application layers masking read-write sub-Library, sub-Library, cache and other technical details.
    • Data layer: The DB data is fragmented and shard shards can be added dynamically.

The key point of the horizontal sub-Library is the design of the data access layer, the data access layer mainly consists of three parts:

    • ID Generator: Generate primary key for each table
    • Data source routing: route each DB operation to a different shard data source
    • Caching: Data caching with Redis for improved performance

The ID generator is the core of the entire horizontal library, which determines how the data is split, and the query store-retrieves the data. The ID needs to be globally unique across libraries, or it will cause a conflict in the business layer. In addition, the ID must be numeric and ascending, mainly considering that the ascending ID will guarantee the performance of MySQL. At the same time, the ID generator must be very stable, because any failure can affect all database operations.

Our ID generation strategy draws on Instagram's ID generation algorithm. The specific options are as follows:

The binary length of the entire ID is 64 bits the first 36 bits use the timestamp to ensure that the ID is increased in ascending order and the intermediate 13 bits are the library identifier, which identifies the database in which the current ID corresponds to the record in which the last 15 bits are the self-increment sequence to ensure that the ID is not duplicated in the same second. Each shard library has a self-increment sequence table that, when generating the self-increment sequence, obtains the current self-increment sequence value from the self-increment sequence table and adds 1 to the last 15 bits of the current ID

Summarize

Entrepreneurship is the process of running against time, in order to quickly meet business needs, we use simple and efficient solutions, such as the use of cloud services, application services directly access to a single point of DB, and later with system pressure increases, performance and stability are gradually taken into account, and db most prone to performance bottlenecks, we use read-write separation, vertical sub-Library, Horizontal sub-Library and other programs. In the face of high performance and high stability, the architecture upgrade needs to be completed as far ahead as possible, otherwise, the system may be slow or even downtime.

High-performance service-side optimization

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.