An Algorithm for ranking massive user points

Source: Internet
Author: User

Problem
A massive user website has points, which may be updated at any time during use. Now we need to design an algorithm for this website to display its current point ranking every time a user logs on. The maximum number of users is 0.2 billion. The credits are non-negative integers and smaller than 1 million.

PS: It is said that this is a face-to-face question of thunder. However, the question itself is very authentic. Therefore, this article intends to consider it according to the actual scenario, not limited to the ideal environment of the face-to-face question.

Storage Structure
First, we use a user_score table to save the user's point information.

Table Structure:

 

Sample Data:

 

The following algorithm is based on this basic table structure.

Algorithm 1: simple SQL query
First, it is easy to use a simple SQL statement to query the number of users whose points are greater than the user points:

Select 1 + count (t2.uid) as rank
From user_score t1, user_score t2
Where t1.uid = @ uid and t2.score> t1.score
For user 4, we can get the following results:

 

Algorithm features

Advantage: It is simple. With the SQL function, no complicated query logic is required, and no additional storage structure is introduced, it is a good solution for small-scale or low-performance applications.

Disadvantage: The user_score table needs to be scanned for the whole table. You also need to take into account the query. If there is a point update, the table will be locked. In applications with massive data volumes and high concurrency, performance is unacceptable.

Algorithm 2: Uniform partitioning design
In many applications, caching is an important way to solve performance problems. We will naturally wonder if we can use Memcached to cache user rankings? However, it seems that the cache is not very helpful because the User ranking is a global statistical indicator, not the user's private attribute, other users' point changes may immediately affect their rankings. However, the point changes in real applications are also subject to certain rules. Generally, the points of a user do not suddenly increase or decrease, generally, a user will slowly upgrade to a high-score zone after a long time of Low-partition mixing. That is to say, the distribution of user points is generally segmented, we further noted that the minor changes in user points in High-score areas have little impact on the ranking of Low-score users. As a result, we can think of the statistical method by the integral segment to introduce a partition integral table score_range:

Table Structure:

 

Data example:

 

Indicates that there are count users in the [from_score, to_score) range. If we divide an interval by 1000 points, there will be [0, 1000), [1000,200 0 ),..., [999000,100 0000). In the future, you must update the range value of the score_range table for user credits. To query the ranking of users whose points are s with the help of the partition integral table, you can first determine the corresponding range and accumulate the count value of the point range higher than s, then, the ranking of the user in the current range is queried. the ranking of the user is obtained by adding the two.

At first glance, this method seems to reduce the query calculation amount through Range Aggregation, but it is not. The biggest question is how to query users' rankings in this range? Add the integral condition to the SQL statement in algorithm 1:

Select 1 + count (t2.uid) as rank
From user_score t1, user_score t2
Where t1.uid = @ uid and t2.score> t1.score and t2.score <@ to_score
Ideally, because the range of t2.score is limited to less than 1000, if you index the score field, we expect this SQL statement to greatly reduce the number of rows in the scanned user_score table through the index. However, this is not the case. The t2.score range is less than 1000, which does not mean that the number of users in this range is also 1000, because the same points exist here! The 20% law tells us that the first 80% of Low-partition users usually concentrate on of users, which means that the performance of intra-range ranking queries for a large number of low-partition users is far inferior to that of a few high-partition users, in general, this partitioning method will not bring about substantial performance improvement.

Algorithm features

Advantage: Pay attention to the existence of the credit interval, and eliminate full table scanning by pre-aggregation.

Disadvantage: the uneven distribution of credits makes Performance Improvement unsatisfactory.

Algorithm 3: Tree partition design
The failure of the even partition query algorithm is due to the non-uniformity of the integral distribution. Then we naturally wonder if we can design the score_range table as a non-even interval according to the law? For example, the low-score zoning is intensive, and the interval is 10 minutes and then gradually becomes 100 minutes, 1000 minutes, 10000 minutes... Of course, this is a method, but this method is random and difficult to grasp, and the integral distribution of the entire system will gradually change as it is used, the original better partitioning method may become unsuitable for the future. We hope to find a partition method that can adapt both to the non-uniformity of points and to the changes in the integral distribution of the system. This is a tree partition.

We can take [0, 1,000,000) as the first-level interval, and then divide the first-level interval into two-level intervals [0,500,000), [500,000, 1,000,000 ), divide the second-level interval into four three-level intervals [0,250,000), [250,000,500,000), [500,000,750,000), [750,000, 1,000,000), and so on, in the end, we will get 1,000,000 21-level intervals )... [999,999, 1,000,000 ). This actually organizes the interval into a balanced binary tree structure. The root node represents the first-level interval. Each non-leaf node has two subnodes, and the left subnode represents the low-score interval, the right subnode indicates the high-score interval. The tree-like partition structure requires that an Invariant be maintained when updating: The count value of a non-leaf node is always equal to the sum of the count values of its left and right subnodes.

The performance improvement from the database to the memory algorithm can often reach 10 ^ 5 or above. Therefore, this algorithm can achieve very high performance.

Algorithm features

Advantage: The structure is stable and is not affected by the integral distribution. The complexity of each query or update is at the O (log (n) level of the maximum integral value, and is irrelevant to the user scale; independent from SQL, it is easy to transform into a NoSQL or memory data structure.

Disadvantage: the algorithm is more complex.

Algorithm 4: integral ranking Array
Although algorithm 3 has high performance, it achieves the complexity of O (log (n) with integral variation, but its implementation is more complicated. In addition, the complexity of O (log (n) only shows its advantages when n is particularly large, and the point changes in practical applications are often not too great, at this time, it is often inferior to the O (n) algorithm, and may even be slower.

Taking this situation into consideration, we can carefully observe the specific impact of point changes on rankings and find that a user's points have changed from s to s + n, the ranking of other users whose points are less than s or greater than or equal to s + n is not affected, but the ranking of users whose points are within the range of [s, s + n) is reduced by one. We can use an array of 1,000,000 to indicate the correspondence between points and rankings. rank [s] indicates the ranking corresponding to points s. During initialization, the rank array can be calculated from the complexity of the user_score table in O (n. The query and update of user rankings are based on this array. Query the rank corresponding to the point s and return rank [s] directly. The complexity is O (1). When the user points change from s to s + n, you only need to increase the value of n elements from rank [s] to rank [s + n-1] by 1, and the complexity is O (n ).

Algorithm features

Advantages: the integral ranking array is simpler and easier to implement than the Interval Tree; the ranking query complexity is O (1); the ranking update complexity is O (n ), it is very efficient when points do not change much.

Disadvantage: WHEN n is large, a large number of elements need to be updated, which is less efficient than algorithm 3.

Summary
This section describes several algorithms for user point ranking. Algorithm 1 is simple and easy to understand and implement, and is suitable for small-scale and low-concurrency applications. algorithm 3 introduces a more complex tree partition structure, however, O (log (n) features superior complexity and can be applied to massive volumes and high concurrency. algorithm 4 uses a simple ranking array, which is easy to implement, the performance is no less than algorithm 3 in the case of few points. This is an open question. I believe there are other excellent algorithms and solutions!

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.