Original address: Http://www.infoq.com/cn/articles/vip-11-11-peak-system-practice
Different from other online shopping brand will be positioning is "a special sale of the site", the business model for "brand Discount + Limited time snapping + genuine insurance", namely "Flash purchase" (flash sales) mode. A one-stop shopping experience for consumers, with up-to-10 percent deep discounts and fun time-limited snap-in in new products every day
This flash-time special-sale business feature determines that the site needs to handle high-concurrency, large-volume user requests at all times. A large number of buyers in each new brand after the launch, a large influx, snapping up goods, resulting in a large number of Web site traffic. In particular, hit the hot goods, the site of concurrent access spikes, will cause the entire site load, response delay, serious even if there is a service outage situation.
In addition, the product will have a large number of business sales model, such as self-sales mode, JIT, direct, sea Amoy, and so on, these business sales patterns lead to the system is very complex. The external system needs to connect the supplier system and the third party logistics system through the open platform. The internal system includes many systems, such as supplier management, commodity selection, commodity trading, payment system, logistics warehousing, customer service system, goods distribution and so on. These system functional modules are very strong correlation, logical extension is very complex, how to quickly meet the needs of business development, is a very urgent problem.
In order to ensure the system in high concurrency, large traffic access to work, and to make the system has strong scalability, our design mainly from the following aspects:
- Effective segmentation of System modules
- Service decoupling, centralized service governance
- Increase asynchronous access
- Multi-stage cache for reduced back-end pressure
- Optimizing Database Access
- Enhanced system Monitoring
Effective segmentation of System modules
Although the entire business system has been split into several relatively independent subsystems such as trading platform (business-to-consumer), VIS, WMS, TMS, ODS, CS, EBS, etc., these business systems are seriously coupled in the actual operation. When new business logic comes in, it takes a lot of changes for each module, and the development teams are arguing over the business logic, wasting a lot of time and making the development less efficient. This is mainly due to the unreasonable partitioning of the modules, resulting in unclear boundaries between the modules. So our architecture team from the whole system point of view, comb the entire process, re-do the system positioning, the different business subsystems do physical separation, reduce dependence between each other, so that each subsystem independent deployment, problems can quickly take measures to isolate the problem module, the failure to minimize the impact.
Service decoupling, centralized service governance
Service-oriented design has been proven by the mainstream e-commerce system to be a practical direction. Through SOA service transformation, it realizes the separation of service users and service providers, makes the service decoupling between systems and high cohesion within the system, greatly simplifies the system complexity, has more scalability and expansibility, and satisfies the need of rapid business development.
How do we manage these services effectively?
Venus is a spring-based Java development framework that is self-developed to reduce the complexity of development, improve developer productivity, improve code quality, and standardize development processes.
The Venus Framework covers the following elements:
- Database access layer encapsulation, support for sub-Libraries, sub-tables, Connection pool optimization
- Cache (redis/memcached) interface encapsulation, connection pooling optimization
- CRUD Service Code Auto-generation (contains database operations)
- Osp/rest Service Invocation Interface encapsulation and asynchronous support
- Validateinternals
- unit/Integration test template code auto-generated
- Configuration Center
- Central Documentation Center
Venus Ecological System
The main objective of the Open Service Platform (OSP) is to provide a service-centric remote invocation mechanism. The contractual service interface ensures that the decoupling between the system is clear and clean, the communication and protocol layer based on thrift ensures the high performance of the system, the service can be automatically registered and discovered, easy to deploy, with the configuration center, the service configuration can be dynamically updated, and the separation of client and governance logic makes the service access greatly simplified; OSP provides a wealth of service governance capabilities, such as routing, load balancing, service protection, and graceful demotion, through OSP, which effectively implements flow control.
Service triage
First, the OSP proxy has the function of the soft load, the system does not need the hardware load balancing, can distribute the service request evenly to the different service host. In addition, OSP can configure the routing of Services, service requests can be assigned to different versions of the service processing, so it is easy to achieve grayscale publishing.
Service Limit Flow
When the system flow reaches the limit, there is an automatic fuse mechanism. Fuses are actively disconnected from the client after an exception occurs in the service or surrounding environment (such as the network), thus preventing service crashes from recovering. However, at a later time the fuse will use a small amount of requests to try to detect if the service has been recovered, and if the recovery will again provide the service to the client call. The mechanism of the fuse is to protect the service and reduce the human intervention. The relevant thresholds are configured in the Configuration center and support the dynamic modification to take effect. Limit current must be used with caution, to use the appropriate current limit policy, distinguish between normal access and malicious requests, can not erase the normal user request. If it is not possible to differentiate between malicious requests, you need to rank the app to ensure that the highest priority applications are accessible, such as all product information on-line. And for the product information of the downline, can be based on the requested capacity for the appropriate current limit.
Nginx rate limiter is a self-developed anti-brush tool, via the Lua script plugin on Nginx, to achieve on-nginx HTTP access control of the tool, in order to improve the promotion and other high traffic environment to ensure the stability of the system operation capacity.
Nginx Rate limiter configuration and information viewing through the RESTful API interface, you can configure the global switch, or you can add multiple current limit configurations for the specified URLs, including the global current limit. The current limit configuration can be selected in one of the following ways:
- Maximum access request rate, exceeding drop request
- The request is dropped proportionally.
Service downgrade
For the e-commerce system, in order to ensure the user experience, in the limited resources, we must ensure the stability of the critical system. By defining different downgrade strategies for different business levels, the ability to perform a policy-based shutdown of functions other than the core main process, depending on the system pressure, to achieve the purpose of service demotion, such as online product information, we must ensure priority access, and for the product information of the downline, we can allow the access capacity is limited, Allow access to the off-line product Details page.
Increase asynchronous access
For the low-real-time requirements of the operation, if the execution time is more time-consuming, through asynchronous processing to improve the performance of the caller, improve responsiveness, especially through asynchronous call notification non-main process, speed up the main business process of the system response speed and performance, asynchronous processing mechanism can play a role in buffering, The downstream system notified can control the amount of data according to its own ability, avoid the impact of overload, ensure the stable operation of the system, and increase the availability of the system.
Distributed asynchronous Message Queuing servers ensure that messages are not lost after downtime, and that asynchronous systems have retry mechanisms to improve system availability, robustness, and scalability.
After the user orders, other systems such as logistics, supplier systems, distribution, finance and other systems need to obtain order details, order status, order system through the asynchronous message mode to notify other systems, asynchronous calls to achieve isolation decoupling between systems, upstream and downstream system business logic separation, Downstream systems only need to parse asynchronous messages for processing, without relying on the business logic of the upstream system, thus reducing the dependency between the systems. Even if the downstream system is abnormal, the order system can still process the data normally.
Multi-stage cache, reducing back pressure 1, static and dynamic separation
Static can reduce the back-end pressure, on the one hand through the user browser cache static resources, failure time through Cache-control to control. On the other hand through the CDN cache, such as the Product Details page, in order to improve the efficiency of the cache, the Product Details page can be pseudo-static, the URL suffix displayed as HTML, product description information and other static information in the user access, cached to the CDN node near the user, in addition to improve the efficiency of CDN, Push product images to CDN in advance. Other commodity dynamic data can be dynamically loaded, such as product operation information, commodity inventory, sizing chart, etc., thereby reducing the need to avoid unnecessary background access.
2. Distributed Cache
The introduction of distributed cache, the centralized management of the cache data Service node, can support the cache cluster elastic expansion, through the dynamic increase or decrease the node response to changing data access load, through redundancy mechanism to achieve high availability, no single point of failure, not due to server failure caused by the cache service outage or data loss. The application side accesses the cache server using a unified API interface.
Through the distributed cache, it can do the Application object cache, database cache, session state and the state data cache when the scale is applied.
3, skillfully use the application server local cache
Distributed caching effectively addresses the access pressure, but because the cache servers are distributed across different network locations and deployed in different datacenters, the increase in traffic will result in I/O and bandwidth bottlenecks. To this end, those basic unmodified configuration data, global data can be cached locally on the application server, reducing the peak impact on the backend cache server instances. Local caches need to be used sparingly, and if a large number of local caches are used, it can result in multiple storage of the same data by different nodes, resulting in significant waste of memory resources.
There are many advantages to using caching to improve the system performance, but the unreasonable use of the cache can not only improve the function, but become a cumbersome system, and even affect the operation of the system, resulting in a great deal of risk. Caching is not recommended for data that is frequently modified, access data without hotspots, and data consistency requirements.
Optimizing Database Access
In the case of high concurrency and large data volume access, database access bottleneck has always been a headache. If there is a performance problem with database access, the entire system will be affected.
To do this, you need to optimize database access to address high concurrency issues in the following ways:
- Optimize complex queries, improve database query efficiency, and find the key modules of database slow query optimization. For example, reduce joins between database tables, refactor database table-related indexes, optimize WHERE clauses, and so on.
- Ensure that on the basis of the implementation of the function, minimize the number of access to the database, through the query parameters, minimize the number of access to the table, minimize the result set, thereby reducing the network burden, can separate the operation as far as possible separate processing, improve each response speed, query with a few columns to select a few columns, Reduced database access IO load pressure.
- Based on the characteristics of e-commerce system reading and writing, the use of read-write separation technology, through a master multi-slave, write operations only occur in the main table, multi-operations occur on the table, can greatly alleviate the access pressure on the primary database.
- For business and System subdivision, the database table is split vertically. Think of the database as a number of "Data blocks" (tables), vertically cut these "chunks", and then spread them across multiple database hosts, so as to disperse the database access pressure per unit of time. Vertical segmentation is based on the principle that the business is close and the tables are closely related to each other.
- After vertical segmentation, you need to further analyze the data volume and growth rate of the partitioned table to determine if horizontal segmentation is required. For the core data such as orders, the method of horizontal partitioning, through the consistent hashing algorithm, using the user ID to distribute the order data evenly on each database partition, in the application system query, the user ID call hash algorithm to find the corresponding database partition, Thus, the database access pressure in the peak period is dispersed to different database partitions, which can realize the linear scaling of the database and greatly improve the load-carrying capacity of the database.
- With distributed caching, caching provides performance that is much larger than database access. When an application reads data, it first looks for the data it needs from the cache, and if it finds it, executes it directly and finds it from the database. At design time, you need to prevent cache penetration pressure from cache invalidation.
- Allow a certain degree of data redundancy, for the key modules, in order to prevent dependence on other modules to affect the performance and reliability of the current module, can moderately save the other modules of the key data, reduce the access to other module services resulting from system loss and reliability pressure.
- Use NoSQL databases to store and process large amounts of data.
Enhanced system Monitoring
Business systems are typically comprised of many distributed components, consisting of Web type components, RPC serviced type components, cache components, message components, and database components. When a front-end request arrives through a browser or mobile client, it passes through a number of business components and system components, leaving footprints and related log information. But these log messages scattered across each business component and host are not conducive to the root cause of troubleshooting and locating problems. This kind of monitoring scene is the application performance monitoring system, the application of the monitoring system to collect, summarize and analyze the log information to effectively monitor the effectiveness of the system and problems. By monitoring information, you can clearly locate the cause of the problem and let developers fix the problem in a timely manner.
Only the product will have three levels of monitoring, system/Network level monitoring, application level monitoring and business level monitoring.
System/Network Level monitoring, mainly to monitor the following indicators: Server metrics, such as CPU, memory, disk, traffic, TCP connections, etc. database metrics such as QPS, master-slave replication delay, process, slow query, etc.
Business level monitoring, through two methods, the first one in the specified page to do the buried point, the second method from the business system database, the need to monitor the data extracted, do the necessary analysis and processing, deposited into the maintenance of their own database; then through the browser page, display the monitoring data, The page also provides filtering, summarization on various time dimensions. Key metrics for some businesses such as PV, UV, merchandise display, login/registration, conversion rate, shopping cart, order quantity, payment volume, shipment volume, and order data for each warehouse. You can customize the alarm range to notify the person concerned in order to respond.
Application level monitoring System mercury, is the only product will be independent research and development of application performance monitoring platform. Real-time monitoring of application code, relational databases, and caching systems is achieved by embedding probe logic in the application. By collecting logs and reporting logs, mercury instantly obtains relevant performance indicators and performs intelligent analysis, and discovers the performance problems, anomalies and errors of distributed application system in time, which provides a convenient and reliable basis for the system to solve the problems of performance and program. At the same time, through the mercury data display platform, users can easily access the application of 360-degree monitoring information.
The main functions Mercury provides in the system are:
- Locate slow calls: include slow Web services (including RESTful Web services), slow OSP service, slow SQL
- Positioning error: Includes 4XX,5XX,OSP error
- Locating exceptions: including error exception,fatal Exception
- Show dependencies and topologies: Domain topology, service topology, trace topology
- Trace call Chain: The end-to-end invocation, and the context information attached to this call, the Exception log information, each call point time is presented to the user
- Application alarm: According to operation and maintenance set alarm rules, scan the indicator data, such as violation of the alarm rules, then the alarm information will be reported to the only central alarm platform
The mercury architecture consists of the following major modules:
After the log is transferred by the client to the server, it is divided into two paths. The first path, the bare log (Trace log/exception log), is Kafka through the flume and then landed directly into HBase. These bare logs are used to query trace call chain information and exception logs. Another path, the log information is sent directly to the spark stream via Kafka, and after the analysis by spark, the data points performance metrics are generated and written opentsdb through Flume. The most important thing in the entire transmission process is to ensure that data consumption is not lost and backlog.
Once the alarm rules have been configured by the operator, the alarm module can trigger the alarm action. The alarm information can escalate the fault to the central alarm platform at the first time.
Conclusion
The above points are some of our experience of high concurrency system, we are constantly improving the system, in order to be the only product will be bigger and stronger sustained efforts, but also hope that through this share to bring you a certain harvest.
"Turn" only 11.11: Peak system Coping practices