Discussion on the algorithm of mass user integration ranking

Source: Internet
Author: User

The content of this article is excerpted from the first phase of the Code agricultural periodical: discussion on the algorithm of mass user integral ranking. Mainly describes several processing user points ranking algorithm, I hope we have some gains, thank you!

Problem
A large amount of user sites, users have points, points may be used in the process of updating at any time. Now you want to design an algorithm for the site that displays its current point rank each time the user logs in. The maximum user size is 200 million, the integral is a nonnegative integer and less than 1 million.
PS: It is said that this is the thunder of an interview problem, but the problem itself has a strong authenticity, so this article intends to follow the real scene to consider, and not limited to the ideal environment for interview questions.


Storage structure
First of all, we use a User integration table User_score to save the user's points information.

Table structure:

Sample data:

The following algorithm is based on this basic table structure.


Algorithm 1: Simple SQL query

First of all, it's easy to think of a simple SQL statement to query the number of users with points greater than that user's score:

Select 1 + count (t2.uid) as Rankfrom user_score T1, user_score t2where t1.uid = @uid and T2.score > T1.score

For number 4th users we can get the following results:

Algorithm features
Advantages: Simple, the use of SQL functions, do not require complex query logic, nor the introduction of additional storage structure, small-scale or performance requirements of the application is not high? A good solution.
Disadvantage: A full table scan of the User_score table is required, and it is also necessary to take into account the fact that the query has a lock on the table at the same time as the integration update, which is unacceptable in the case of massive data size and high concurrency applications.


Algorithm 2: Uniform partitioning design

Caching is an important way to solve performance problems in many applications, and we naturally wonder: can we cache the user rankings with memcached? But again? To find that the cache does not seem to help, because the user ranking is a global statistical indicator, rather than the user's private property, other users of the change of points may immediately affect the user's ranking. However, the real application of the integration of the change in fact there is a certain regularity, usually? A user's points will not suddenly burst, the average user will always be in the low partition for a long time to slowly rise into the high partition, that is, the overall distribution of user integration is a section, We further note that the subtle changes in high-partition user points actually have little impact on the ranking of low-segmented users. So, we can think of the method of statistics by the integral section, introduce a partition integral table Score_range:

Table structure:


Data examples:

Indicates that the [From_score, To_score] interval has count users. What if we divide by 1000? A range has [0, 1000), [1000, 2000], ..., [999 000,1 000 000) These 1000 intervals, and later updates to the user's integral update the interval values of the Score_range table accordingly. In the auxiliary of the Partition integration table to query the ranking of the users of the S, you can first determine its own interval, the value of the integral interval above s is accumulated, and then query the user's ranking in this interval, the two add to get the user's ranking. At first glance, this method seems to reduce the amount of query computation through interval aggregation, but it is not. The biggest problem is: How to query the user's ranking within this interval? If the integration condition is added to the SQL in algorithm 1:

Select 1 + count (t2.uid) as Rankfrom user_score T1, user_score t2where t1.uid = @uid and T2.score > T1.score and T2.sco Re

Ideally, due to limiting the range of T2.score to less than 1000, if the score field is indexed, we expect this SQL statement to significantly reduce the number of rows scanned by the User_score table by index. However, the real situation is not so, the range of T2.score within 1000 does not mean that the number of users in the interval is also 1000, because there are points in the same situation exists! The 28 law tells us that the first 20% of the low partition tends to focus on 80% of the users, which means that for a large number of low-partition users in the interval ranking query performance is far less than a few high-partition users ranked query, so in the general case this partitioning method does not bring substantial performance improvement.

Algorithm features
Pros: Note the existence of the integration interval and the full table scan by pre-aggregation to eliminate the query.
Disadvantage: The non-uniform distribution of integral features makes the performance improvement is not ideal.


Algorithm 3: Tree-shaped partition design

The failure of the Uniform partition query algorithm is due to the non-uniformity of the integral distribution, then we will naturally think, can we press the 28 law to design the Score_range table as a non-uniform interval? For example, the Low division zoning is dense, 10 points a range, and then gradually become 100 points, 1000 points, 10 000 points ... Of course, this is a method, but this kind of division has certain randomness, not easy to grasp, and the whole system of integral distribution will gradually change with the use, the first good partitioning method may become not adapt to the future situation. We hope to find a partitioning method which can adapt to the integral inhomogeneity and the change of the system integral distribution, which is the tree partition. We can use [0, 1 000 000] as a class level, then divide the first level interval into two 2 levels [0, 500 000), [500 000, 1 000 000), and then divide the two-class interval into 4 3-level intervals [0, 250 000), [250 000, 500 000 ), [500 000,750 000], [750 000, 1 000 000), and so on, we will eventually get 1000 000 levels of 21 [0,1], [...] [999 999, 1 000 000). This is actually to organize the interval into a kind of balanced binary tree structure, the root node represents the first level interval, each non-leaf node has two sub-nodes, the left Dial hand node represents a low partition, and the right sub-node represents a high-score interval. The tree partition structure needs to maintain an invariant when updating (invariant): The count value of a non-leaf node is always equal to the sum of the count values of its left and right child nodes.

Later, each time the user integral has the change need to update the interval number and the integral change amount has the relation, the smaller the integral change the lower the interval level of the update. Overall, the number of intervals that need to be updated is the log n level of the user's integration variable, that is, if user integration changes at millions, the number of update intervals is at level 20. In this tree-shaped partition integration table with the help of the query points for the user ranking of S, in fact, in the interval tree from top to bottom, from the coarse to the fine step to clear the location of the process of S. For example, for points 499 000, we add a rank variable with an initial value of 0, first, it belongs to the left subtree of the 1-level interval [0, 500 000), then the user rank should be in the right subtree [500 000, 1 000 000) the number of users count, We add the count value to the user rank variable, enter the next level interval, secondly, it belongs to the level 3 interval of [250 000, 500 000), which is the right subtree of the 2-level interval, so do not add count to the rank variable, directly into the next level of the interval; again, it belongs to the 4-level interval ... So reciprocating, until the end we put the user's integral exactly in the 21-level interval [499 000, 499 001), the whole accumulation process is completed, the ranking is drawn!

Although the algorithm's updates and queries involve a number of operations, if we index the From_score and To_score of the interval, these operations are based on key queries and updates and do not produce a table scan and are therefore more efficient. In addition, the algorithm does not depend on relational data model and SQL operation, can easily be transformed into other storage methods such as NoSQL, while key-based operations are also easy to introduce caching mechanism to further optimize performance. Further, we can estimate that the number of tree intervals is approximately 200000 000, considering the size of each node, the entire structure occupies only dozens of M space. Therefore, we can create an interval tree structure in memory, and initialize the interval tree in O (n) time through the User_score table, and then the query and update operations of the rankings can be done in memory. Generally speaking, the same algorithm, from database to memory algorithm performance can often reach 10 5 or more, so the algorithm can have very high performance.

Algorithm features
Advantages: The structure is stable, not affected by the integral distribution; the complexity of each query or update is the O (log n) level of the integral maximum, and is independent of the size of the user and can handle a large amount of scale; not dependent on SQL, easy to transform into nosql or memory data structures.
Cons: The algorithm is relatively more complex.


Algorithm 4: Integral rank array

Although the algorithm 3 has high performance, it achieves the complexity of O (log n) of integral change, but the implementation is more complicated. In addition, the complexity of O (log n) only shows its advantage when n is very large, but the actual application of the integration is often not too large, when compared with O (n) algorithm often has no obvious advantage, or even slower.

With this in mind, a closer look at the specific impact of the integration changes on the rankings can be found that a user's points from S to s+n, points less than s or greater than or equal to S+n of other user rankings are not actually affected, only the number of points in the [s, S+n] range of users will be reduced by 1 bits. We can use an array of size 100 000 000 to represent the correspondence between integrals and rankings, where Rank[s] represents the rank of the integral S. Initially, the rank array can be computed by the User_score table within the complexity of O (n). Queries and updates for user rankings are based on this array. Query points for the corresponding ranking directly return Rank[s], the complexity of O (1), when the user's integration from S to s+n, only need to rank[s] to rank[s+n-1] The value of n elements increased by 1, the complexity of O (n).

Algorithm features
Advantages: The integral rank array is simpler and easier to implement than the interval tree, and the rank query complexity is O (1), and the rank update complexity O (n), which is very efficient in the case of small integral changes.
Cons: When n is larger, you need to update a large number of elements, less efficient than algorithm 3.


Summarize

The above describes the user integration ranking of several algorithms, the algorithm 1 simple, easy to understand and implement, for small-scale and low concurrent applications; the algorithm 3 introduces a more complex tree partition structure, but O (log n) is superior in complexity, can be applied to massive scale and high concurrency, and the algorithm 4 is easy to implement by using simple rank array. , the performance is as great as algorithm 3 in the case of small integral change. This problem is an open question, I believe there must be other excellent algorithms and solutions, welcome to explore!

Discussion on the algorithm of mass user integration ranking

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.