Large scale leaderboard system practice and challenge

Source: Internet
Author: User
Tags etcd

Copyright notice: This article by Tangcon original article, reprint please indicate source:
Article original link: https://www.qcloud.com/community/article/154

Source: Tengyun https://www.qcloud.com/community

The leaderboards meet the People's comparisons, show off psychology, almost every product will be involved. SNG value-Added products QQ Members, QQ animation, Penguin esports, game events and other large numbers of business have a strong demand for the leaderboard, especially the Penguin Competition and other business development and expansion of our leaderboard system put forward more requirements and challenges. In the past year, the leaderboard system from scratch, access to the business from a single QQ membership to Penguin esports animation and other 20 types of business, access to the number of leaderboards to achieve from a few to tens of thousands of of the breakthrough, a single leaderboard number of users up to 90 million, leaderboard storage cluster active users hundreds of millions of, and in this process, The leaderboard system also meets the following challenges:

    • How to support the nearest access to the business? Low latency?

    • How to support the tens of thousands of or even hundreds of thousands of level leaderboard Automation application scheduling?

    • How to reduce machine cost? Choose the right storage engine?

    • How to avoid each business resource preemption, mutual influence?

The next sections will discuss in detail how we are currently addressing these challenges in practice.

I. Leaderboard SYSTEM BASIC Architecture

Before discussing how we can solve the challenges we face, we'll start with a brief look at the basic architecture of the leaderboard system and how the business is connected and using the leaderboard system. The basic structure of the leaderboard system is shown, the current leaderboard system architecture is not designed, but according to our business scenarios, requirements, development and other evolving, continuous optimization, and its main consists of the following several services:

    • Access Service (stateless, provides access to all RPC interfaces that modify leaderboard data for business use such as query rankings, TOPN, etc.)

    • Storage services (stateful, master standby, leaderboard data storage)

    • Apiserver Service (provide the application leaderboard, business access, leaderboard configuration information, storage capacity and other interfaces)

    • Scheduling Service (a storage machine that filters the conditions of a business request from a storage machine and assigns it in a preferred form)

    • Agent (escalation of storage capacity information, storage service monitoring data, etc.)

    • ZooKeeper (Leaderboard routing data storage, storage capacity data storage, etc., why we choose ZooKeeper will be explained in the third section)

    • Mysql (Business access information, each leaderboard configuration, user volume, storage service core parameter monitoring and other storage)

Business access to the leaderboard system, the first one to assign an ID for each business, followed by the application of the leaderboard ID, the business leaderboard server through the L5 call leaderboard system access services, access services each interface contains business ID, leaderboard ID, access services through the business ID, The leaderboard ID query zookeeper Gets the storage service instance for the business leaderboard ID, and ultimately operations the leaderboard data through the storage Service API, returning the results to the business Server.

What is the L5 mentioned above? L5 (load balancer,5 refers to Level5, which is 99.999% availability) is a set of fault-tolerant systems with load balancing and overload protection, when a business service accesses L5, an identity (Modid, Cmdid) is assigned, which maps several Business Server Ip:port, Business machines need to deploy L5 agent, L5 the biggest disadvantage is that the business services to be transferred through the L5 call, you must modify the code, each network call through the L5 Agent API to get the Ip:port, call after the end of the call delay, return code, and so on.

Two. How to support the nearest access to the business? Low latency?

Because the department product business is more, the service deployment area also is different, some business deploys in Shenzhen area, some business deploys in the Shanghai area. Shenzhen, Shanghai intranet computer room ping delay of about 30ms, so high delay is part of the business can not tolerate, as a platform support, we also hope to provide the average delay of each interface should be within 5ms, so as far as possible to avoid cross-city access. How to avoid cross-city visits? Of course, the area of autonomy, early because of access to business, the number of leaderboards are less, only access services, storage services machine is deployed by region, leaderboard routing data storage only deployed in Shenzhen, leaderboard routing will only be killed in the case of LocalCache will be cross-city query Shenzhen routing data storage cluster, Latency also met business needs, so we did not fully support the proximity of the business. However, with the rapid access to various types of business, and the rapid increase in the leaderboard, especially when the LocalCache failure, the Shanghai area of service quality, delay fluctuations, frequent alarm, so we have to advance the regional comprehensive self-government program.

Business scenarios determine what storage scenarios we choose to address the high latency issues that result from cross-city access.

Simple analysis of our business scenario (Business Core link request):

    • What data is stored? How big is the amount of data? Routing, storage node capacity data, estimated hundreds of thousands of, key, value length is small (of course, you can also use table structure storage)

    • Read/write ratio? Read-only, very small write (few per minute, the routing configuration is added when applying for leaderboards)

    • What is Cap's choice? As far as possible to ensure data consistency in each region, no loss of data, high availability (such as one node downtime does not affect the service read), in the presence of network partitions (such as Shenzhen, Shanghai Network Interruption), a few members of the cluster (Shanghai area), can be degraded to provide read-only mode.

Common open source solutions, and the advantages and disadvantages of each program are as follows:

store principle Advantage less
mysql mysql binlog stability, friendly SQL interface, configurable replication mode according to Business scenario system availability is not as good as ETCD, zoo Keeper
etcd Raft Strong consistency, high availability, more mature, kubernetes and other large-scale projects are widely used, but are still in the fast Development, the team has not applied in the production environment
zookeeper Zab High availability, high performance read, team has been in production for many years, has a matching web configuration system less friendly than ETCD for deployment maintenance, C API usage

Combined with the advantages and disadvantages of the above solutions, and based on our business scenario needs, we have chosen zookeeper as the core of the routing, storage node capacity data storage, and then we also face the deployment of Shenzhen, Shanghai or only one set of options. If Shenzhen, Shanghai, the deployment of a set, our plan is the Shenzhen cluster-based write, master write success, write opration log (create/set) can write mq,mq need to support at-least-once semantics, by the consumer asynchronously written to Shanghai cluster, because create/ Set/del are idempotent interfaces, for network fluctuations, interruptions and other consumers write to the Shanghai cluster failure, can be unlimited retry, to ensure that the two cluster of data eventual consistency. However, given the very low write request of our business scenario, and when the network partition occurs, the zookeeper Shanghai cluster can turn on read only mode, and we also have ZK proxy cache,local cache, we finally chose to deploy a set of In the production environment we deployed 7 zookeeper nodes (distributed in Shenzhen, Shanghai 4 IDC), through the zookeeper itself Zab algorithm to achieve the Shenzhen, Shanghai region data synchronization, each region to achieve complete autonomy, the overall deployment plan is shown in Figure four, after the deployment of business call delay See figure Five ( Less than 2ms).


Three. How to support the tens of thousands of or even millions leaderboard application?

Similar to the low-latency deployment scenario for regional autonomy, the current scheduling system is also a continuous optimization of the application process of the leaderboard, and the application of the leaderboard can be divided into the following three periods:

    • Stone Age: When the system was just online, several leaderboards, manually configured.

    • Bronze Age: Dozens of leaderboards, manual approval of leaderboards via web interface.

    • Iron Age: Tens of thousands of leaderboards, automated capacity planning with scheduling service screening, scoring to select the optimal storage, no manual intervention required

As mentioned above, the leaderboard system offers two ways to apply for the leaderboard, one to submit the application form on the Web management platform, one to apply the leaderboard in real time through API server's API, to try out the business scenario where the leaderboard is not frequently applied, and the latter for business scenarios that require a large number of leaderboards.

So how do we design and implement scheduling services? By analyzing our business scenarios, the business will generally need to fill in the application list of the deployment area (Shenzhen, Shanghai), the estimated number of users, requests, leaderboard type, whether container deployment, storage engine type and other conditions parameters, Therefore, the core function of the leaderboard scheduling service is to select the candidate nodes that satisfy the conditions of the business application from the effective storage node, and to score the candidate nodes according to a certain strategy, and choose the optimal fractional node allocation.

Based on the above business scenario, we designed and implemented the scheduling service, scheduling service flow six, consisting of two parts, screening and scoring, the filter module will run a series of modules according to the configuration, such as health check module, tag matching module, capacity check matching module. The Health Check module checks the viability of all candidate nodes, and the activity is determined by the temporary nodes of the zookeeper, and when a machine is hung up, the temporary node is automatically deleted. Tag matching modules are flexible and provide powerful filtering capabilities, such as filtering deployment areas, physical machine deployment, container deployment, storage engine types, and more. Capacity matching module Check whether the candidate nodes are full capacity, the remaining capacity can support the current leaderboard, and so on, then a number of G memory container can support how many leaderboards? First, the total capacity of a container to support the number of users, we can calculate an empirical value as the upper limit according to the data on the line, second, capacity planning now we adopt two strategies, one is hard limit, suitable for the business can accurately predict the number of users of the leaderboard, if the business designated such resource allocation, Will actually pre-allocate the specified number of users, the other is the soft limit, the business can not predict the number of users, scheduling service based on this business history leaderboard users to calculate an average value, if the average value is lower than the minimum user threshold of the business configuration, the pre-allocation threshold, by default we adopt this strategy. Through the filter module filter after the candidate node will go to the next round of scoring module, scoring module supports a variety of scheduling algorithms, such as minimum resource scheduling (select the most memory resources remaining node), multi-weighted hybrid scheduling (according to user request parameters, CPU, memory to give different weights), The leaderboard is eventually dispatched to the node that will be the highest rated node. The dispatch data store is saved in zookeeper, each storage machine deploys the agent, and periodically escalate the storage capacity information to the zookeeper cluster.

Four. How to reduce the cost of the machine? Choose the right storage engine?

What storage engine is selected to store leaderboard data? We analyze the basic operation of the leaderboard, query the user rank/score, update the user position, query the previous name, delete the user, etc., some businesses need to use all these interfaces, some of the business only need to use some of these interfaces (such as updating user scores, to obtain a number of previous names). Our optional storage engine has memory-based storage Redis, disk-based storage leveldb,rocksdb, etc., Redis provides a variety of rich data structure, wherein the sorted sets (Zset) can fully meet our various interface requirements, Zset core data structure is a hash table + Jumping tables, where the hash table holds the scores for each user, the jump tables maintain the rank corresponding to each score, the time complexity of increasing user scores and querying the user rank is log (N), so it is suitable for business use with high performance requirements. And Leveldb, ROCKSDB only provide the key, value type interface, why can also be in some business scenarios (no need to query the user position) can also be used? Figure VIII is the LEVELDB architecture, it is known that LEVELDB is a key is ordered storage, SSDB is to support the Redis zset data structure by encoding the score through certain rules and also as a key, but SSDB query ranking time complexity is O (N), It is only suitable for business use in a production environment that does not query the rank of the user, but can support querying the entire top N of the leaderboard (n is typically less than or equal to 200).

Therefore, the first way to reduce costs is to choose the right storage engine based on each business feature and type on Demand! Businesses that do not need to query rankings can use the SSDB (LEVELDB) disk storage engine!

Look at the cost of reducing the second method, as the number of leaderboards more and more, Redis leaderboard storage machine also gradually increased, storage machine resources, the application cycle is long, the cost is also high, through the analysis of online inventory ranking, found that some of the business has obvious periodicity, such as various business activities, Race rankings and other online promotion of large traffic, the end of the activity almost no traffic access but can not empty the entire leaderboard, for this kind of business, the leaderboard system provides a hot and cold separation mechanism, the cold data from the Redis memory migrated to the SSDB (LEVELDB) hard disk, thereby freeing up valuable memory resources, Increase machine resource utilization and save costs.

Data hot and cold separation scheme nine, through the agent to collect all the leaderboard traffic, daily timing analysis whether there is a leaderboard to meet the migration strategy (such as the number of leaderboard users more than 10,000, nearly two weeks of traffic less than a threshold, etc.), if the policy is satisfied, generate a migration task, record the migrated leaderboard a series of metadata information, write MQ The migration service polls MQ, and if a migration task is found, it starts the migration work, backs up the full-scale leaderboard data to the SSDB (LEVELDB) storage node, shrinks or empties the current Redis leaderboard data, frees up memory resources, and figure Ten is cold data, thermal data ratio analysis, and cold data accounted for up to 17%.

Five. How to avoid the preemption of business resources and influence each other?

More and more business access at the same time, the various business rankings before the resource contention, mutual impact has become an issue that can not be ignored, such as a business ranking in Shanghai to apply for a large number of rankings, triggering the storage capacity limits of the region, all business applications to the Shanghai regional rankings have failed. In order to solve the interaction between the various business, the leaderboard system realizes the business resource quota and resource isolation scheme. Leaderboard Resource Isolation Container Scenario 11 shows that the container daemon engine uses Docker, the network mode uses the host mode, no performance loss, simple controllable, data volume using host mapping, container restart, downtime data are not lost, and subsequent testing using the Distributed File System Ceph. Mirrored storage Driver Select company TLINUX2 operating system comes with Aufs, AUFS is not suitable for the file in the container to write frequently, the first write needs copy up and multi-layer branch retrieval cost, and our business scenario to the container image file read and write operations are very small, It will also minimize the potential kernel bugs that are stored in the image, and compare the advantages and disadvantages of various mirrored storage drivers as shown in Figure 12. Registry uses a common warehouse inside the company, Docker Daemon uses the Docker 1.9.1 maintained by the internal Gaia team, and the benefit of using an internal maintenance version is that it has been used by a large number of other businesses in the company and is more stable than the latest version of Docker. At the same time, we have some bugs, requirements, feedback to the Gaia team, they will quickly repair, merge the new version of the features we need. For example, the two Docker daemon bugs we encountered earlier have been resolved or avoided in the build.
https://github.com/docker/docker/pull/22932
https://github.com/docker/containerd/pull/265


Figure 12 advantages and disadvantages of storing image drivers (refer to Docker official)

Compared to the previous hybrid deployment, after the container solution is launched, each business access needs to simply estimate this business future leaderboard total user volume, the expected amount of resources, scheduling service through a certain strategy from the container storage node, filter an optimal node, dynamically create a number of G container allocated to this business use, After this business all the leaderboard will be dispatched to the container, if the container resources are not enough, on the one hand can increase the size of the container resources online, on the other hand can add containers, a business corresponding to multiple containers.

Finally, the advantages and disadvantages of the Docker container resource isolation scheme are summarized.

Advantages:

    • Isolation of business resources, through Cgroup limit the use of memory, CPU and other resources, and can be based on the volume of Business final data, volume of requests online dynamically adjust resource size.

    • Simplify deployment, an image image, any storage machine, simple and fast multiple redis containers, leveraging the resources of multi-core machines to isolate each business.

    • Improve machine resource utilization. Before the physical machine one Master one standby mode, the standby load is very low, after containerized deployment, the container master and standby can be mixed deployment on a machine, scheduling only need to ensure that the primary and standby containers are not the same machine.

Disadvantages:

    • All container processes share the same kernel space, if a container triggers a kernel bug, it will cause the kernel to crash, affect all the container processes on the machine, so in the production environment practice should absorb the industry's best practice experience, primary and standby container data hot synchronization, backup data regularly, strengthen monitoring, disaster tolerance.

    • A lower version of Docker daemon will cause all container processes to die, the docke1.12 version adds --live-restore parameters, and if this parameter is specified, the container process will continue to run after Docker daemon is closed, https://github.com/docker /docker/pull/23213.

Six. Summary

In the process of solving the above problems, the leaderboard system has gradually realized a set of high availability, low latency, low-cost leaderboard solutions (including automatic access, scheduling, disaster tolerance, resource isolation, monitoring, expansion capacity, data cold separation, etc.), follow-up will strengthen the system of self-healing capacity building, For example, the full automatic switching of primary and standby instances with little write traffic (for instances with large write traffic due to Redis being asynchronous replication, there is a hundreds of thousands of difference between master_repl_offset and slave offset), and data volumes using CEPH-RBD, even if the primary and standby containers are hung, By monitoring the system, you can dynamically create new containers, mount CEPH-RBD data volumes, rebuild Redis instances, and more importantly, support online migration at the leaderboard level.

Large scale leaderboard system practice and challenge

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.