Recommendation system evaluation methods and indicators

Source: Internet
Author: User

First, I declare that the following content was written after reading the item bright "Recommendation System Practice". The content is basically from the book, but I will summarize it myself (so as not to spray it again)


In the recommendation system, there are three experimental methods for evaluating recommendation results:

1) offline experiments. It is often to obtain user behavior data from the log system, and then divide the data set into training data and test data, such as 80% of training data and 20% of test data (Cross verification is also supported ), then, the user's interest model is trained on the Training dataset and tested on the test set. Advantage: Only one dataset is required, and the actual recommendation system is not required (the actual recommendation system cannot be used for testing). Offline computation does not require human intervention, it can easily and quickly test a large number of different algorithms. The disadvantage is that it is impossible to obtain many indicators of the actual recommendation system, such as click-through rate, such as the conversion rate (who asked no human intervention ..)


2) user survey. Offline experiments often measure the most accuracy, but the accuracy is not equal to satisfaction. Therefore, before launching an algorithm, you need to investigate and test user satisfaction.


3) In AB testing, users are randomly divided into several groups based on certain rules, and different recommendation algorithms are used for users in different groups, in this way, we can fairly obtain some performance indicators of different algorithms online. However, the disadvantage is that the cycle is long and long-term experiments are required to obtain reliable results.


The main measurement indicators are as follows:

1) user satisfaction. This is the most critical indicator. What does the recommendation system Recommend? We hope that the recommended items will satisfy users. There can be two methods: user questionnaire survey and online satisfaction evaluation. For example, Douban's recommended items have buttons for satisfaction and dissatisfaction, amazon can calculate whether recommended items have been purchased by users. Generally, click-through rate, user stay time, conversion rate, and other indicators are used for measurement.


2) prediction accuracy. If it is similar to the movie scoring mechanism, the Root Mean Squared Error (mean sum of squared errors) and mean absolute error (absolute values of errors and average values) are generally calculated ). For Top N recommendation, the recall rate and accuracy are mainly calculated. Accuracy refers to how many of the N items I recommend are correct and their proportion. Recall rate refers to the number of items in the correct results that appear in the recommendation results. The difference between the two is that the former has recommended the number of results as the divisor, and the latter has the correct number as the divisor.


3) coverage rate. It means whether the recommendation results can cover all products well, and whether all products have the opportunity to be recommended. The simplest method is to calculate the proportion of all recommended products to the total number of items. Of course, this is rough and can be measured more accurately by information entropy and Gini coefficient.


4) diversity. The recommendation results should reflect diversity. For example, when I watch a movie, I like watching a big fight movie and loading literature and art, therefore, the recommendation list should be available for both types of movies, and should be recommended based on the proportion of my hobbies. For example, I usually watch the big fights for 80%, 20% is for literature and art, so it is best to use this proportion in the recommendation results. It can be calculated based on the similarity between items. If the similarity between all items in a recommendation list is relatively high, it usually indicates that the items are of the same category and lack of diversity.


5) novelty. It cannot be said that I actually know the items recommended by the system. In this way, the recommendation system completely loses its meaning. Generally, I want to recommend products that users do not know or have not seen products that have never been bought. The first method is to retrieve the products that have been viewed and purchased, but this is not enough. Generally, the average popularity of recommended products is calculated, because the less popular items, the more novel The users will feel. For example, if I love Stephen Chow, the recommendation of "linqi" is very novel, because we do not know that this is played by Stephen Chow.


6) Surprise. There is a difference between this and novelty. The surprise is about why I cannot find out why I recommend this item, such as a movie, but after reading it, I thought it was quite satisfying with my appetite. This is a pleasant surprise. As in the above example, as long as I know it was played by Stephen Chow, there may be no surprises, because I know that this movie was recommended to me by actors. Note: There are currently no measurable standards for novelty and surprise.


7) Trust. If users trust the recommendation system, they will often increase interaction with the recommendation system to obtain better personalized recommendations. The method to increase trust is often to provide a recommendation explanation, that is, why the product is recommended and justified. You can also increase trust by using a friend relationship similar to Facebook. In general, you will always choose recommendations from friends compared to recommendations from strangers.


8) Real-time performance. News and other items are highly Real-Time. Generally, recommendations must be made when they are effective. The recommendation system must consider the ability of handling items in cold start.


9) robustness. To prevent attacks, for example, some sellers can register many fake accounts to improve their rankings and give their products a high score.


10) business goals. Generally, recommendation systems are designed to make better profits ..... Of course, this is hard to test...

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.