Recommendation system evaluation methods and indicators

Last Update:2014-07-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, I declare that the following content was written after reading the item bright "Recommendation System Practice". The content is basically from the book, but I will summarize it myself (so as not to spray it again)

In the recommendation system, there are three experimental methods for evaluating recommendation results:

1) offline experiments. It is often to obtain user behavior data from the log system, and then divide the data set into training data and test data, such as 80% of training data and 20% of test data (Cross verification is also supported ), then, the user's interest model is trained on the Training dataset and tested on the test set. Advantage: Only one dataset is required, and the actual recommendation system is not required (the actual recommendation system cannot be used for testing). Offline computation does not require human intervention, it can easily and quickly test a large number of different algorithms. The disadvantage is that it is impossible to obtain many indicators of the actual recommendation system, such as click-through rate, such as the conversion rate (who asked no human intervention ..)

2) user survey. Offline experiments often measure the most accuracy, but the accuracy is not equal to satisfaction. Therefore, before launching an algorithm, you need to investigate and test user satisfaction.

3) In AB testing, users are randomly divided into several groups based on certain rules, and different recommendation algorithms are used for users in different groups, in this way, we can fairly obtain some performance indicators of different algorithms online. However, the disadvantage is that the cycle is long and long-term experiments are required to obtain reliable results.

The main measurement indicators are as follows:

1) user satisfaction. This is the most critical indicator. What does the recommendation system Recommend? We hope that the recommended items will satisfy users. There can be two methods: user questionnaire survey and online satisfaction evaluation. For example, Douban's recommended items have buttons for satisfaction and dissatisfaction, amazon can calculate whether recommended items have been purchased by users. Generally, click-through rate, user stay time, conversion rate, and other indicators are used for measurement.

2) prediction accuracy. If it is similar to the movie scoring mechanism, the Root Mean Squared Error (mean sum of squared errors) and mean absolute error (absolute values of errors and average values) are generally calculated ). For Top N recommendation, the recall rate and accuracy are mainly calculated. Accuracy refers to how many of the N items I recommend are correct and their proportion. Recall rate refers to the number of items in the correct results that appear in the recommendation results. The difference between the two is that the former has recommended the number of results as the divisor, and the latter has the correct number as the divisor.

3) coverage rate. It means whether the recommendation results can cover all products well, and whether all products have the opportunity to be recommended. The simplest method is to calculate the proportion of all recommended products to the total number of items. Of course, this is rough and can be measured more accurately by information entropy and Gini coefficient.

4) diversity. The recommendation results should reflect diversity. For example, when I watch a movie, I like watching a big fight movie and loading literature and art, therefore, the recommendation list should be available for both types of movies, and should be recommended based on the proportion of my hobbies. For example, I usually watch the big fights for 80%, 20% is for literature and art, so it is best to use this proportion in the recommendation results. It can be calculated based on the similarity between items. If the similarity between all items in a recommendation list is relatively high, it usually indicates that the items are of the same category and lack of diversity.

5) novelty. It cannot be said that I actually know the items recommended by the system. In this way, the recommendation system completely loses its meaning. Generally, I want to recommend products that users do not know or have not seen products that have never been bought. The first method is to retrieve the products that have been viewed and purchased, but this is not enough. Generally, the average popularity of recommended products is calculated, because the less popular items, the more novel The users will feel. For example, if I love Stephen Chow, the recommendation of "linqi" is very novel, because we do not know that this is played by Stephen Chow.

6) Surprise. There is a difference between this and novelty. The surprise is about why I cannot find out why I recommend this item, such as a movie, but after reading it, I thought it was quite satisfying with my appetite. This is a pleasant surprise. As in the above example, as long as I know it was played by Stephen Chow, there may be no surprises, because I know that this movie was recommended to me by actors. Note: There are currently no measurable standards for novelty and surprise.

7) Trust. If users trust the recommendation system, they will often increase interaction with the recommendation system to obtain better personalized recommendations. The method to increase trust is often to provide a recommendation explanation, that is, why the product is recommended and justified. You can also increase trust by using a friend relationship similar to Facebook. In general, you will always choose recommendations from friends compared to recommendations from strangers.

8) Real-time performance. News and other items are highly Real-Time. Generally, recommendations must be made when they are effective. The recommendation system must consider the ability of handling items in cold start.

9) robustness. To prevent attacks, for example, some sellers can register many fake accounts to improve their rankings and give their products a high score.

10) business goals. Generally, recommendation systems are designed to make better profits ..... Of course, this is hard to test...

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Recommendation system evaluation methods and indicators

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Recommendation system evaluation methods and indicators

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support