1. Overview
Most websites require a scoring system to consider the quality of the website entries. For example, Time Network:
In this article, we will learn how to build an available scoring system.
2. Build a simple scoring system
In fact, it was very easy to create a scoring system. First, we should design the data table required by the scoring system, that is, the user rating table (user_rating ):
Create Table 'user _ Rating '(
'Id' int (11) unsigned not null auto_increment,
'User _ id' int (11) default null,
'Item _ id' int (11) default null,
'Rating' int (11) default null,
Primary Key ('id ')
) Engine = InnoDB default charset = Latin1;
Next, you only need to calculate a simple average:
3. Start with search engines
Let's skip the scoring system. Let's take a brief look at the development history of search engines.
In 1994, Yang Zhiyuan and David created Yahoo !, It marks the beginning of modern search engines, But Yahoo is not actually a real search engine, because most of its web pages are manually added, we may call it "website Collection ".
As there are more and more websites on the Internet, there is no way to load such a huge workload manually, so there is a full-text retrieval technology, users enter keywords, the search engine then searches for these words on the webpage and returns them to the user.
In fact, this method is very good, but unfortunately, the website depends on PV and clicks, so many websites pile up keywords by unscrupulous means and do the so-called Seo.
The Internet began to fall into chaos.
At this time, Google was founded in 1997 and introduced the famous page-rankAlgorithmTo a large extent, this problem has been solved and has become a milestone in the development of search engines. The concept of page-Rank also made collective intelligence fly into ordinary homes for the first time.
4. What is collective wisdom?
Let's take a look at the definition on the Wiki:
Collective Intelligence is a shared or group intelligence that emerges from the collaboration and competition of individual Individuals and appears in consensus demo-making in bacteria [clarification needed], animals, humans and computer networks.
In short, collective intelligence is a shared or group intelligence that is embodied by many individuals through collaboration and competition. Maybe this is still very abstract. Let's take an example in reality: When you want to buy a notebook, you will ask your friends how is this notebook, then, after combining the opinions of many people, you will make your own decisions. This process is the embodiment of collective wisdom.
Next let's go back to the entry scoring example. How do you know if you want to watch a movie when it is just released, the easiest way to do this is to go online and check your rating for this movie. Maybe the taste of a person or two is not the same as that of you, but after considering the opinions of many people, the score of this movie makes sense. The algorithm, as we mentioned in section 2, is equal to the average score.
5. PageRank algorithm
PageRank, which is often called the PR value in the SEO field, is also often used as an authoritative Measurement Method for websites.
This algorithm is based on a premise assumption that everyone on an authoritative website points to him, and an authoritative website points to an authoritative website. In simple terms, we can think that the higher the weight of a website to be directed to, the higher the weight. In addition, one page can only vote for other pages. If a webpage points to N websites, each website is only voted for 1/N.
So we can come up with a simple formula:
Pr (x) is the PageRank of a Web page, L (x) is the number of external links only for the web page, and 1-Q is the minimum rank value assigned to each web page by the system.
Conversion to a matrix is in the following format:
Next, the system starts iterative calculation until the prvalue of the entire system converges to a certain value.
We reflect itCode(Using the R language ):
Here we set the maximum number of iterations to 100.
In addition, the PageRank algorithm also introduces the Gini coefficient concept, and also makes great adjustments to the performance and some Parameter Details. Here, we will not go into detail to understand it.
5. page_rank vs. Keyword matching
I still remember that in section 3, keyword matching is a good method. He was defeated by PageRank largely because of the appearance of spam pages.
In fact, the same is true for the scoring system. No matter whether it is for reading movies or scoring commodities, it has great commercial value. Therefore, we can no longer rely on the average score, the concept of rank should also be imitated to introduce to each user.
6. Calculate userrank
Let's take a look at the significance of PageRank, which is to measure the importance of a Web page, so we can migrate the idea of PageRank to userrank. That is to say, userrank measures the importance of a user.
In this case, how can we measure the importance of users? In the scoring system, I think the significance of rank lies in the user's taste. In movie ratings, we believe that the weight of each user depends on the taste of the user's movie, and the taste of the user's reading is determined by the user's reading taste.
Therefore, we will divide the calculation of the entire scoring system into two steps:
A. calculate user-rank
B. Calculate the entry score
Here we use iterative approximation to calculate user weights. It is equivalent to the PageRank algorithm. First, we think that the weight of each user is the same. In fact, the score of an item is the average score of all users.
In any algorithm, we need to make a premise assumption that this is the premise of the algorithm and data mining field. Here we make the premise assumption that it is close to the public taste, this means that the user's score has a greater influence, that is, the weight is higher.
So let's perform the first iteration. A simple method is to compare the Pearson correlation between the scores of all entries of the user and the current scores of the corresponding entries.
After iteration, we can obtain the Pearson correlation degree of each user. This correlation degree is considered as the weight of the user after this iteration, because there may be negative correlation, therefore, we believe that the negative correlation value should be 0, that is:
Next, we re-calculate the score of the Post-iteration item based on the weight:
After each iteration, we can calculate the gap between the score after this iteration and the score after the previous iteration:
We generally think that when the entry scoring gap after two iterations is less than a certain threshold value, we call the result a consortium.
However, in general engineering practice, for example, if there are million users and million entries, each iteration takes a great deal of time, so we need to adopt some clever methods.
7. engineering practices of the scoring system
In general, we may adopt the following two methods of optimization:
A. Reduce the number of entries. We know the 2-8 rule. In fact, this rule applies to the scoring system in most cases. 80% of the scores are from 20% entries, so we only need to select the 20% entries, or fewer representative entries can greatly reduce the computing workload of each iteration.
B. specify the number of iterations. Just like calculating PageRank in R code, we can manually specify the number of iterations without waiting for full convergence.
8. Significance of the userrank Algorithm
Let's say that the birth of PageRank algorithm greatly reduces the number of spam pages. What is the significance of userrank?
We divide spammer into two types,
A. malicious ratings, such as malicious scalping of high software ratings and movie ratings, are typically characterized by a batch of entries or high or low ratings, then they will inevitably deviate from the average score of most users, so their rank will be very low.
B. disrupt the order. They do not maliciously brush high or low scores. They just use the website rating as a joke, so the algorithm can also weaken this behavior.
9. can rank replace anti-spammer?
Previously, we mentioned that PageRank effectively prevented the generation of junk web pages. However, spammer has come up with many ways to curb PageRank algorithms, such as building many garbage sites and then building internal links between them.
But Google won't be knocked down by such behavior because their algorithms are not only PageRank (I even suspect they are using PageRank), but also excellent anti-spammer algorithms. Similarly, in engineering practice, many algorithms in this paper hope to use rank algorithms to resist spammer behavior. However, I always think rank should only reflect user taste, it reflects the similarity between the user and the public. In the first layer of rank, there should be an anti-spammer algorithm to identify spammer users.
There are different identification methods for different websites. simply by using IP addresses to identify spam users, it is complicated to identify spam users through Machine Learning Based on behavior attributes. This is not due to the scope of this article.
10. Summary
Almost all websites with entries require a scoring system to help users make choices. This not only provides users with a reference for consumption, it can also provide a good support for the internal data of the website (for example, the top 10 most well received by users ).
This article describes how to create an almost complete scoring system by calculating userrank, hoping to help you.