Mr. Ruan talked about the entry sort

Source: Internet
Author: User
Tags vcard ticket

Ranking algorithm based on user voting (i): Delicious and hacker News

Nanyi

The advent of the internet means "information explosion".

Users worry that information is not too little, but too much information. How to find the most important content quickly and effectively from a lot of information becomes a core problem of the Internet.

A variety of ranking algorithms, is currently filtering information one of the main means. Ranking information means that information is ranked in order of importance and updated in a timely manner. The basis of the arrangement, can be based on the characteristics of the information itself, but also based on the user's vote, that is, let the user decide, what information can be ranked in the first place.

Below, I will collate and analyze some ranking algorithms based on user voting, intending to split into six parts, today is the first one.

First, Delicious

The most intuitive and simplest algorithm is to rank the number of votes per user in a unit of time. The items that get the most votes are naturally ranked first.

The old version of the delicious, there is a "top bookmark leaderboard", that is the statistics.

It ranks in the "number of collections in the last 60 Minutes". It is counted once every 60 minutes.

The advantage of this algorithm is relatively simple, easy to deploy, the content update is quite fast, the disadvantage is that, on the one hand, the ranking change is not smooth enough, the first one hours is also top of the content, often the second hour on the plummeted, on the other hand, the lack of automatic elimination of old project mechanism, some popular content may occupy the forefront of

Second, Hacker News

Hacker News is an online community that can post links or discuss a topic.

There is an upward triangle in front of each post, and if you think this content is good, click on it and vote on it. According to the number of votes, the system automatically counted the top article rankings. However, not the most votes in the article ranked first, but also to consider the time factor, the new article should be easier than the old article to get a good ranking.

Hacker News was written using the arc language developed by Paul Graham and the source code can be downloaded from arclanguage.org. Its ranking algorithm is implemented in this way:

Revert the above code to a mathematical formula:

which

P indicates the number of votes in the post, minus 1 to ignore the poster's vote.

T is the time to post (in hours), plus 2 to prevent the most recent posts from causing the denominator to be too small (2 is chosen, possibly because it takes two hours from the original article to the other site, to the hacker News).

G represents the "Gravity factor" (Gravityth power), the strength of the post ranking downward, the default value is 1.8, this value will be discussed in detail later in this article.

Judging from this formula, there are three factors that determine the ranking of posts:

The first factor is the number of votes p.

In the case of other conditions, the higher the number of votes, the higher the ranking.

As you can see, there are three simultaneous posts, with a total of 200 votes, 60 votes and 30 votes (1, 199 and 59 minus 29), respectively, in yellow, purple and blue. At any point in time, the yellow curve is at the top, and the blue curve is at the bottom.

If you don't want the gap between "high-ticket" and "low-ticket" too big, you can add a less than 1 index to the number of votes, such as (P-1) ^0.8.

The second factor is the time t from the post.

In the case of other conditions, the more newly posted posts, the higher the ranking. Or, the ranking of a post will continue to fall over time.

As you can see from the previous picture, after 24 hours, all posts scored basically less than 1, which means they will all fall to the end of the leaderboard, guaranteeing that the top-ranked will be the newer content.

A third factor is the gravity factor G.

Its numerical size determines the rate at which the rankings fall over time.

As you can see, the other parameters of the three curves are the same, and the values for G are 1.5, 1.8, and 2.0, respectively. The larger the G value, the steeper the curve, and the faster the rankings fall, which means the leaderboard is updated faster.

Knowing the composition of the algorithm, you can adjust the value of the parameter to suit your own application.

--------------------------------------------------------------------------------------------------------------- ------------------

Ranking algorithm based on user voting (ii): Reddit

Nanyi

Date: March 7, 2012

(Sorry, this series has been interrupted for nearly two weeks, I will be in these days as soon as possible to write the following several.) )

Last time, I introduced hacker News's ranking algorithm. The feature is that users can only vote in favor, but many sites also allow users to vote against it. In other words, you can give a bad rating to an article in addition to your praise.

Reddit is the largest online community in the United States, each of which has an upward and downward arrow in front of each post, saying "aye" and "no" respectively. The user clicks to vote, Reddit according to the poll result, calculates the newest "hot article leaderboard".

How can we combine the affirmative votes with the negative vote to figure out the most popular articles for a while? If article A has 100 votes in favour, 5 against, the article B has 1000 votes in favour, 950 votes against, who should be in front of the list?

The Reddit program is open source and is written in the Python language. The code for the ranking algorithm is roughly as follows:

This code takes into account several factors:

(1) New and old level of posts T

t = Posting time-December 8, 2005 7:46:43

The units of T are in seconds and are computed with a UNIX timestamp. It is not difficult to see, once the post is published, T is fixed value, will not change over time, and the newer the post, the more T value. As for December 8, 2005, it should be the time when Reddit was established.

(2) The difference between the affirmative vote and the negative vote X

x = Affirmative vote-negative

(3) Voting Direction y

  

Y is a symbolic variable that represents an overall view of the article. If the affirmative vote is the majority, Y is +1; if the negative vote is the majority, Y is-1; If the affirmative vote and the negative vote are equal, Y is 0.

(4) degree of positive (negative) of posts Z

  

Z represents the absolute value of the difference between affirmative and negative votes. If you evaluate a post, the more one-sided it is, the greater the Z. If the affirmative vote equals a negative, Z equals 1.

Combining the above variables, the final scoring formula for Reddit is as follows:

  

This formula can be divided into two parts to discuss:

A

  

This part indicates that the higher the difference between the affirmative and the negative z, the higher the score.

It is important to note that this is based on the logarithm of 10, which means that the z=10 can get 1 points and z=100 can get 2 points. That is, the first 10 voters and the next 90 voters (and then the next 900 voters) the same weight, that is, if a post is particularly popular, then the more you vote in the back, the less impact on the score.

When the affirmative vote equals the negative, z=1, so this part equals 0, that is, does not produce the score.

Two

  

This section says that the larger the T, the higher the score, which means that the new post will score higher than the old post. It plays the role of automatically pulling down the rank of old posts.

The denominator of 45,000 seconds equals 12.5 hours, which means that the post of the day will be 2 points more than the previous one. In conjunction with the previous section, it can be concluded that if the previous day's post was to maintain its original position on the second, it would have to increase its z-value by 100 times times (100 times times the net vote).

The role of Y is to generate extra points or subtract points. When the affirmative vote exceeds the negative vote, this part is positive and plays an added role, and when the affirmative vote is less than the negative, this part is a minus, and it plays a role of subtraction; When the two are equal, this part is 0. This guarantees that a large number of net votes in favour of the article, will be ranked in the forefront of the affirmative votes and no negative close to or equal to the article, will be ranked in the back; the article that gets net negative will be at the end (because the score is negative).

Three

One problem with this algorithm is that they are unlikely to be at the forefront of controversial articles (which are very close to the affirmative and negative votes). Assuming that there are two posts at the same time, article A has 1 votes in favour (the poster cast), 0 votes against, article B has 1000 votes in favour, 1000 votes against, then A's ranking will be higher than B, which is obviously unreasonable.

The conclusion is that the rank of Reddit is basically determined by the posting time, the super popular article will be ranked in the front, the general popular articles, controversial articles are not very forward. This determines that Reddit is a community of popular tastes, not a very radical place to show minority ideas.

------------------------------------------------------------------------------------------------------------

Ranking algorithm based on user voting (iii): Stack Overflow

Nanyi

Date: March 11, 2012

In a previous article, I introduced Reddit's ranking algorithm.

It is characterized by that the user can vote in favour of the vote or vote against it. That is, in addition to the time factor, just consider two variables is enough.

However, there are some specific uses of the site that must be considered for more factors. The world's number one programmer's quiz community, Stack Overflow, is such a Web site.

You put up a variety of questions about programming, waiting for others to answer. Visitors can vote on your question (affirmative or negative) to indicate whether the question is valuable.

Once someone has answered your question, other people can also vote on the answer (affirmative or negative).

The function of the ranking algorithm is to find out the hot issues in a certain period of time, that is, which problems are most concerned and are discussed most.

On the Stack Overflow page, there are three numbers in front of each question, indicating the score of the question, the number of answers, and the number of views on the issue. Based on these variables, the algorithm can be designed.

Jeff Atwood, one of the founders, published a formula for ranking scores a few years ago.

Written in PHP code, this is the following:

The meanings of each algorithm variable are as follows:

(1) qviews (number of views on the problem)

  

The more views a problem has, the more attention it will have, and the higher the score. The logarithm of 10 is used here, with the intention that the impact on the score will continue to be smaller as traffic gets larger.

(2) Qscore (question score) and qanswers (number of answers)

  

First, Qscore (question score) = pro-vote-No. If a problem gets more favorable, the ranking should naturally be more forward-leaning.

Qanswers expressed the number of answers, representing how many people were involved in the issue. The larger the value, the more the score will be magnified. Here to note is, if no one answer, Qanswers is equal to 0, then Qscore High also useless, means that again good question, also must be answered, otherwise not into the hot issues list.

(3) Ascores (Answer score)

  

Generally speaking, "answer" is more meaningful than "problem". The higher the score, the higher the quality of the answer.

But I feel that the simple plus total design is not comprehensive enough. Here are two questions. First of all, a correct answer is better than a useless answer, but the simple addition always leads to 1 responses divided into 100 and 100 to 1, with the same total score. Second, because the score will appear negative, so those particularly poor answers will lower the correct answer to the score.

(4) Qage (time to issue from question) and qupdated (time from last answer)

  

Rewrite it to see more clearly:

  

The units of Qage and qupdated are seconds. If a problem exists longer, or the longer the last answer, the values of qage and qupdated increase accordingly.

That is, as time goes by, both values become larger, causing the denominator to increase, so the total score will be smaller and shorter.

(5) Summary

The rank of the Stack overflow hotspot problem is proportional to the degree of participation (Qviews and qanswers) and the quality (Qscore and ascores), inversely related to the time (Qage and qupdated).

Mr. Ruan talked about the entry sort

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.