Discussion on the algorithm of mass user integration ranking

Source: Internet
Author: User

This article is for reference

Http://www.cnblogs.com/weidagang2046/archive/2012/03/01/massive-user-ranking.html#!comments

Problem

A large amount of user sites, users have points, points may be used in the process of updating at any time. Now you want to design an algorithm for the site that displays its current point rank each time the user logs in. The maximum user size is 200 million, the integral is a nonnegative integer and less than 1 million.

PS: It is said that this is the thunder of an interview problem, but the problem itself has a strong authenticity, so this article intends to follow the real scene to consider, and not limited to the ideal environment for interview questions.

Storage structure

First, we use a User integration table User_score to save the user's points information.

Table structure:

Sample data:

The following algorithm is based on this basic table structure.

Algorithm 1: Simple SQL query

First, it's easy to think of a simple SQL statement to query the number of users who have more points than that user's points:

select 1 + count(t2.uid) as rankfrom user_score t1, user_score t2where t1.uid = @uid and t2.score > t1.score

For number 4th users we can get the following results:

Algorithm features

Advantages: Simple, the use of SQL functions, do not require complex query logic, nor introduce additional storage structure, small-scale or performance requirements of the application is a good solution.

Disadvantage: A full table scan of the User_score table is required, and it is also necessary to take into account the fact that if an integration update is to lock the table, performance is unacceptable in massive data size and high concurrency applications.

Algorithm 2: Uniform partitioning design

Caching is an important way to solve performance problems in many applications, and we will naturally want to be able to cache user rankings with memcached. But again, it seems that the cache does not seem to help, because the user ranking is a global statistical indicator, rather than the user's private property, other users of the change of points may immediately affect the user's ranking. However, the real application of the integral changes in fact there is a certain regularity, usually a user's points will not suddenly burst, the average user will always be in the low partition for a long time to slowly rise into the high partition, that is, the overall distribution of user integration is a section, We further note that the subtle changes in high-partition user points actually have little impact on the ranking of low-segmented users. So, we can think of the method of statistics by the integral section, introduce a partition integral table Score_range:

Table structure:

Data examples:

Indicates that the [From_score, To_score] interval has count users. If we divide a range by every 1000 points there are [0, 1000), [1000, 2000], ..., [999000, 1000000) These 1000 intervals, and later update the user's integral to update the interval value of the Score_range table accordingly. In the auxiliary of the Partition integration table to query the ranking of the users of the S, you can first determine its own interval, the value of the integral interval above s is accumulated, and then query the user's ranking in this interval, the two add to get the user's ranking.

At first glance, this method seems to reduce the amount of query computation through interval aggregation, but it is not. The biggest problem is how to query the user's ranking within this interval? If the integration condition is added to the SQL in algorithm 1:

select 1 + count(t2.uid) as rankfrom user_score t1, user_score t2where t1.uid = @uid and t2.score > t1.score and t2.score < @to_score

Ideally, due to limiting the range of T2.score to less than 1000, if the score field is indexed, we expect this SQL statement to significantly reduce the number of rows scanned by the User_score table by index. However, the real situation is not so, the range of T2.score within 1000 does not mean that the number of users in the interval is also 1000, because there are points in the same situation exists! The 28 law tells us that the first 20% of low partitions tend to focus on 80% of users, which means that for a large number of low-partition users in the interval ranking query performance is far less than for a few high-partition users, so in general, this partitioning method will not bring substantial performance improvement.

Algorithm features

Pros: Note the existence of the integration interval and the full table scan by pre-aggregation to eliminate the query.

Disadvantage: The non-uniform distribution of integral features makes the performance improvement is not ideal.

Algorithm 3: Tree-shaped partition design

The failure of the Uniform partition query algorithm is due to the non-uniformity of the integral distribution, then we will naturally think, can we press the 28 law to design the Score_range table as a non-uniform interval? For example, the Low division zoning dense point, 10 points a range, and then gradually become 100 points, 1000 points, 10000 points ... Of course, this is a method, but this kind of division has certain randomness, not easy to grasp, and the whole system of integral distribution will gradually change with the use, the first good partitioning method may become not adapt to the future situation. We hope to find a partitioning method which can adapt to the integral inhomogeneity and the change of the system integral distribution, which is the tree partition.

We can use [0, 1,000,000) as a first-class interval, then divide the first-level interval into two 2-level intervals [0, 500,000], [500,000, 1,000,000], and then divide the two-level interval into 4 3-level intervals [0, 250,000], [250,000, 500,000), [500,000, 750,000), [750,000, 1,000,000], and so on, we will eventually get 1,000,000 21 levels [0,1], [...] [999,999, 1,000,000). This is actually to organize the interval into a kind of balanced binary tree structure, the root node represents the first level interval, each non-leaf node has two sub-nodes, the left Dial hand node represents a low partition, and the right sub-node represents a high-score interval. The tree partition structure needs to maintain an invariant when updating (invariant): The count value of a non-leaf node is always equal to the sum of the count values of its left and right child nodes.

Later, each time the user integral has the change need to update the interval number and the integral change amount has the relation, the smaller the integral change the lower the interval level of the update. Overall, the number of intervals that need to be updated is the log (n) level of the user's integration variable, which means that if the user's integration changes at millions at one time, the number of update intervals is at level 20. In this tree-shaped partition integration table with the help of the query points for the user ranking of S, in fact, in the interval tree from top to bottom, from the coarse to the fine step to clear the location of the process of S. For example, for integral 499,000, we use a rank variable with an initial value of 0 to do the summation, first of all, it belongs to the 1-level interval of the left subtree [0, 500,000), then the user ranking should be in the right subtree [500,000, 1,000,000) the number of users count, We add the count value to the user rank variable, enter the next level interval, secondly, it belongs to the level 3 interval of [250,000, 500,000), this is the 2-level interval of the right subtree, so do not accumulate count to the rank variable, directly into the next level of the interval; again, it belongs to the 4-level interval ... Until the end we put the user's integral exactly in the 21-level interval [499,000, 499,001], the entire accumulation process is completed, the ranking!

Although the algorithm's updates and queries involve a number of operations, if we index the From_score and To_score of the interval, these operations are based on key queries and updates and do not produce a table scan and are therefore more efficient. In addition, the algorithm does not depend on relational data model and SQL operation, can easily be transformed into other storage methods such as NoSQL, while key-based operations are also easy to introduce caching mechanism to further optimize performance. Further, we can estimate that the number of tree intervals is approximately 2,000,000, considering the size of each node, the entire structure occupies only dozens of M space. Therefore, we can create an interval tree structure in memory, and initialize the interval tree in O (n) time through the User_score table, and then the query and update operations of the rankings can be done in memory. Generally speaking, the same algorithm, from the database to the memory algorithm performance improvement can often reach more than 10^5, therefore, the algorithm can achieve very high performance.

Algorithm features

Advantages: The structure is stable, not affected by the integral distribution; the complexity of each query or update is the O (log (n)) level of the integral maximum, and is independent of the size of the user and can handle a large amount of scale; not dependent on SQL, it is easy to transform into a nosql or memory data structure.

Cons: The algorithm is relatively more complex.

Algorithm 4: Integral rank array

Although the algorithm 3 has high performance, it achieves the complexity of O (log (n)) of the integral change, but the implementation is more complex. In addition, the complexity of O (log (n)) only shows its advantage when the n is very large, but the actual application of the integration is often not too large, when compared with O (n) algorithm often has no obvious advantage, or even slower.

With this in mind, a closer look at the specific impact of the integration changes on the rankings can be found that a user's points from S to s+n, points less than s or other users greater than or equal to s+n will not actually be affected, only the number of points in the [S,s+n] range of users will be reduced by 1 bits. We can use an array of size 1,000,000 to represent the corresponding relationship of integrals and rankings, where Rank[s] represents the rank of the integral S. When initialized, the rank array can be computed by the User_score table in the complexity of O (n). Queries and updates for user rankings are based on this array. Query points for the corresponding ranking directly return Rank[s], the complexity of O (1), when the user's integration from S to s+n, only need to rank[s] to rank[s+n-1] The value of n elements increased by 1, the complexity of O (n).

Algorithm features

Advantages: The integral rank array is simpler and easier to implement than the interval tree, and the rank query complexity is O (1), and the rank update complexity O (n), which is very efficient in the case of small integral changes.

Cons: When n is larger, you need to update a large number of elements, less efficient than algorithm 3.

Summarize

The above describes the user integration ranking of several algorithms, the algorithm 1 is simple and easy to understand and implement, suitable for small-scale and low concurrent applications; Algorithm 3 introduces more complex tree partition structure, but O (log (n)) complexity performance is superior, can be applied to massive scale and high concurrency, algorithm 4 uses simple rank array, Easy to implement, in the case of small changes in the performance of the algorithm 3. This problem is an open question, I believe there must be other excellent algorithms and solutions, welcome to explore!

Discussion on the algorithm of mass user integration ranking

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.