Discussion on "reprint" for the algorithm of mass user integration ranking

Source: Internet
Author: User

Content of this article
    • Problem
    • Storage structure
    • Algorithm 1: Simple SQL query
    • Algorithm 2: Uniform partitioning design
    • Algorithm 3: Tree-shaped partition design
    • Algorithm 4: Integral rank array

The question is exactly where it came from, not quite sure, and I saw it in a public number ~ The content of the article is more enlightening ~

Problem

A large amount of user sites, users have points, points may be used in the process of updating at any time. Now you want to design an algorithm for the site that displays its current point rank each time the user logs in. The maximum user size is 200 million, the integral is a nonnegative integer and less than 1 million.

PS: It is said that this is the thunder of an interview problem, but the problem itself has a strong authenticity, so this article intends to follow the real scene to consider, and not limited to the ideal environment for interview questions.

Storage structure

First, we use a User integration table User_score to save the user's points information.

Table structure:

Sample data:

The following algorithm is based on this basic table structure.

Algorithm 1: Simple SQL query

First, it is easy to think of a solution that uses a simple SQL statement to query the number of users with points greater than that user's points:

Select Count  as Rankfrom user_score T1, User_score T2
where  and T2.score > T1.score

For number 4th users we can get the following results:

Algorithm features

  • Advantages: Simple, the use of SQL functions, do not require complex query logic, nor introduce additional storage structure, small-scale or performance requirements of the application is a good solution.
  • Disadvantage: A full table scan of the User_score table is required, and it is also necessary to take into account the query and lock the table if an integration update occurs. performance is unacceptable in massive data size and high concurrency applications .
Algorithm 2: Uniform partitioning design

Caching is an important way to solve performance problems in many applications, and we naturally wonder if we can cache the user rankings with memcached. But again, the cache does not seem to help, because the user ranking is a global statistical indicator, rather than the user's private property, other users of the change of points may immediately affect the user's ranking. However, the real application of the integral changes in fact there is a certain regularity, usually a user's points will not suddenly burst, the average user will always be in the low partition for a long time to slowly rise into the high partition, that is, the distribution of user integration is a segment, We further note that the subtle changes in high-partition user points actually have little impact on the ranking of low-segmented users. So, we can think of the method of statistics by the integral section, introduce a partition integral table Score_range:

Table structure:

Data examples:

Indicates that the [From_score, To_score] interval has count users. If the interval is divided by 1000 points, then there are [0, 1000), [1000, 2000], ..., [999000, 1000000), a total of 1000 intervals, and later update the user integral to update the interval value of the Score_range table accordingly. In the auxiliary of the Partition integration table to query the number of points for the S user ranking, first determine its own interval, the value of the integral interval above s accumulated, and then query the user's ranking in this interval, the two add to get the user's ranking.

At first glance, this method seems to reduce the amount of query computation through interval aggregation, but it is not. The biggest problem is how to query the user's ranking within this interval? If the integration condition is added to the SQL in algorithm 1:

Select Count  as Rankfrom user_score T1, User_score T2
where  and  T2.score < @to_score

Ideally, due to limiting the range of T2.score to less than 1000, if the score field is indexed, we expect this SQL statement to significantly reduce the number of rows scanned by the User_score table by index. However, the real situation is not so, the range of T2.score within 1000 does not mean that the number of users in the interval is also 1000, because there are points in the same situation exists! The 28 law tells us that the first 20% of low partitions tend to focus on 80% of users, which means that for a large number of low-partition users in the interval ranking query performance is far less than for a few high-partition users, so in general, this partitioning method will not bring substantial performance improvement .

Algorithm features

  • Pros: Note the existence of the integration interval and the full table scan by pre-aggregation to eliminate the query.
  • Disadvantage: The non-uniform distribution of integral features makes the performance improvement is not ideal.
Algorithm 3: Tree-shaped partition design

The failure of the Uniform partition query algorithm is due to the non-uniformity of the integral distribution, then we will naturally think, can we press the 28 law to design the Score_range table as a non-uniform interval? For example, the Low division zoning dense point, 10 points a range, and then gradually become 100 points, 1000 points, 10000 points ... Of course, this is a method, but this kind of division has certain randomness, not easy to grasp, and the whole system of integral distribution will gradually change with the use, the first good partitioning method may become not adapt to the future situation. We hope to find a partitioning method which can adapt to the integral inhomogeneity and the change of the system integral distribution, which is the tree partition .

We can use [0, 1,000,000) as a first-class interval, then divide the first-level interval into two 2-level intervals [0, 500,000], [500,000, 1,000,000], and then divide the two-level interval into 4 3-level intervals [0, 250,000], [250,000, 500,000), [500,000, 750,000), [750,000, 1,000,000], and so on, we will eventually get 1,000,000 21 levels [0,1], [...] [999,999, 1,000,000). This is actually to organize the interval into a kind of balanced binary tree structure, the root node represents the first level interval, each non-leaf node has two sub-nodes, the left Dial hand node represents a low partition, and the right sub-node represents a high-score interval. The tree partition structure needs to maintain an invariant when updating (invariant): The count value of a non-leaf node is always equal to the sum of the count values of its left and right child nodes.

Later, each time the user integral has the change need to update the interval number and the integral change amount has the relation, the smaller the integral change the lower the interval level of the update. Overall, the number of intervals required to update each time is the log (n) of the user's integration variable, which means that if the user's integration changes at millions, the number of update intervals is at level 20. In this tree-shaped partition integration table with the help of the query points for the user ranking of S, in fact, in the interval tree from top to bottom, from the coarse to the fine step by stage to determine the location of S.

For example, if the integral is 499,000, the rank initial value is 0. First, it is located in the left sub-tree [0, 500,000) interval, at this point the user ranking is its right subtree [500,000, 1,000,000) the number of users count value accumulated to the user rank, and then, it is located in [250,000, 500,000), So do not add count to the rank variable, directly into the next level of the interval; again, it belongs to Level 4 ... Until the end we put the user's integral exactly in the 21-level interval [499,000, 499,001], the entire accumulation process is completed, the ranking!

Although the algorithm updates and queries involve several operations, indexing the from_score and to_score of the interval is based on key queries and updates that do not result in a table scan and are therefore more efficient. In addition, the algorithm does not depend on relational data model and SQL operation, can easily be transformed into other storage methods such as NoSQL, while key-based operations are also easy to introduce caching mechanism to further optimize performance. Further, we can estimate that the number of tree intervals is approximately 2,000,000, considering the size of each node, the entire structure occupies only dozens of M space. Therefore, we can create an interval tree structure in memory, and initialize the interval tree in O (n) time through the User_score table, and then the query and update operations of the rankings can be done in memory. Generally speaking, the same algorithm, from the database to the memory algorithm performance improvement can often reach more than 10^5, therefore, the algorithm can achieve very high performance.

Algorithm features

  • Advantages: The structure is stable, not affected by the integral distribution; the complexity of each query or update is the O (log (n)) level of the integral maximum, and is independent of the size of the user and can handle a large amount of scale; not dependent on SQL, it is easy to transform into a nosql or memory data structure.
  • Cons: The algorithm is relatively more complex.
Algorithm 4: Integral rank array

Although the algorithm 3 has high performance, it achieves the complexity of O (log (n)) of the integral change, but the implementation is more complex. In addition, the complexity of O (log (n)) only shows its advantage when the n is very large, but the actual application of the integration is often not too large, when compared with O (n) algorithm often has no obvious advantage, or even slower.

With this in mind, a closer look at the specific impact of the integration changes on the rankings can be found that a user's points from S to s+n, points less than s or other users greater than or equal to s+n will not actually be affected, only the number of points in the [S,s+n] range of users will be reduced by 1 bits. We can use an array of size 1,000,000 to represent the corresponding relationship of integrals and rankings, where Rank[s] represents the rank of the integral S. When initialized, the rank array can be computed by the User_score table in the complexity of O (n). Queries and updates for user rankings are based on this array. Query points for the corresponding ranking directly return Rank[s], the complexity of O (1), when the user's integration from S to s+n, only need to rank[s] to rank[s+n-1] The value of n elements increased by 1, the complexity of O (n).

Algorithm features

  • Advantages: The integral rank array is simpler and easier to implement than the interval tree, and the rank query complexity is O (1), and the rank update complexity O (n), which is very efficient in the case of small integral changes.
  • Cons: When n is larger, you need to update a large number of elements, less efficient than algorithm 3.
Summarize

The above describes the user integration ranking of several algorithms, the algorithm 1 is simple and easy to understand and implement, suitable for small-scale and low concurrent applications; Algorithm 3 introduces more complex tree partition structure, but O (log (n)) complexity performance is superior, can be applied to massive scale and high concurrency, algorithm 4 uses simple rank array, Easy to implement, in the case of small changes in the performance of the algorithm 3. This problem is an open question, I believe there must be other excellent algorithms and solutions, welcome to explore!

Discussion on "reprint" for the algorithm of mass user integration ranking

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.