Zheng @ playpoly Sr 20091003
Xlvector of the Chinese Emy of Sciences (that is, Xiang Liang, his team, the ensemble, won the first public test of the Netflix Grand Prix in July, but Netflix announced the BPC victory in September 22, the reason is said to be that they submitted the results 20 minutes later. Recently, grsuggest, a small tool released, is a bit like the "personalized reading" That kuber has done in feedzshare ", all belong toArticleTo recommend other articles that may interest the user ."AlgorithmPrinciple.
Xiang Liang said in "thinking about grsuggest": "The de-duplication problem is very common in the article recommendations. Many articles have been reposted n times, I often find that my old post was reposted a few years ago. In fact, my recommendation system is also reposted ".
This is an extension of three common problems that cannot be solved.
1. Martian Phenomenon
I posted a tweet in the previous paragraph: "I don't know if Digg can solve the problem that Mars posts are frequently recommended. This should be all Digg classes.CommunityCommon problems: no matter how many posts or paragraphs have recently been frequently used by Digg, there will always be a person from time to time when Cheng baobei is released and recommended by a large number of Martian people ."
Some people think that if Mars posts are excellent, they certainly have the right to be turned out. However, note that in a single community, it can be assumed that the user group has a similar knowledge structure, so old posts can be turned out, tianya community has been doing this for many times, but in a recommendation system, if there are many old jokes despite the user's knowledge structure, it is really driving users away.
The key to the Martian phenomenon is that we have discussed it many times before:"The recommendation system cannot learn the user's previous knowledge structure.. That is to say, because a single and new personalized recommendation system does not know the user's knowledge structure (that is, the previous reading experience and experience ), many recommended items must have been well-known and read by users. This is a bad experience for application creators and users, but cannot be avoided at all. Let's take a very simple example. If you have not been mixing in Douban for a long time, Douban always recommends many things you have read, heard, and read according to your few actions, in addition, you are forced to click it one by one to let Douban know your experiences for a long time.
This problem exists in the recommendation system derived from Google Reader shared items. Shared items does not reflect the user's reading experience, because you read an article in greader, it does not mean that you will share it, and you may not like it.This is problem 1: the root cause cannot fully reflect the user's reading experience.
After statistics on shared items Chinese users, a considerable number of users (I estimate 50 ~ 60%) the number of shared articles (that is, the number of blog sources) cannot exceed 5. 10% of users even only Share articles from up to 2 sources. Most articles shared by Chinese users are from the sites listed in "rankings | leaderboard.This is Question 2: Can the recommendation system play a role for a large number of users with narrow reading horizons?
Ii. Timeliness and no Timeliness
In the past, Liu weipeng made a good suggestion for playmates SR: "The article should be divided into" time-sensitive (such as news and political issues) "and" No timeliness "(such as reading notes, GTD methods, etc.), it seems that this requires manual allocation or advanced natural language processing, but I am aware of a good way: generally, people share time-sensitive articles in greader. They discuss time-sensitive articles on Twitter, but "no time-sensitive", or timeless articles will be added to delicious, because greader/Twitter represents sharing discussions, while delicious represents collecting and querying."
He observed a tip: "A non-time-sensitive article will usually be added to delicious for a long time. This is an excellent judgment basis. This attribute does not exist in articles with high timeliness ." In other words, you can check the time when an article is added to the favorites of delicious users and find out which articles are timely.
Xiang Liang also mentioned: "If you want to translate old posts, you must first solve the difference between news and articles. It makes no sense to translate news, however, knowledge-based articles can still be translated."
This is another issue of the Google Reader-based recommendation system:Do you want to recommend articles with high timeliness??
If you can really tell the timeliness of an article, you can add a rule for the "Martian phenomenon:Recommendation systems do not recommend articles with high TimelinessBecause first, users are likely to see it through various channels, such as forums such as Twitter such as Im, and second, although users may not necessarily see it, however, users who are not frequently using the recommendation system will still be impressed by the outdated articles. After all, reading is different from reading movies. You can recommend very old movies, but you cannot recommend very old news.
This can also be done in a non-time-sensitive article: Liu weipeng believes that "the timeliness can be determined to increase the signal-to-noise ratio, and a tab will be created for a non-time-sensitive article for the list, this allows later users to continue accessing the best articles of the past period, rather than a large number of gossip or political issues, the advantage of timeless's excellent article list is the ability to create a new reader's high-quality trust in playlist Sr." Although I provided an archive portal later, I did not differentiate the timeliness.
3. Is it difficult to surprise?
Xiang Liang believes: "recommendation articles should not only be related to users' interests, but also serve to help users expand their horizons. There have been a lot of research in this area over the past few years, that is to say, to find out what can surprise users, but the main problem of this algorithm is that it cannot be evaluated because it does not know what is a user surprise."
Yes, it's hard to surprise.
What is surprise? In addition to the user's knowledge structureCurrentFavorite items (articles, movies, music, pictures, videos ). The so-called "present" is because a user's point of interest is dynamic.
Why can stumbleupon always surprise users?
Garret camp, algorithm designer of stumbleupon, once provided a flowchart describing how to press stumbel! Stumbleupon background process:
The figure shows three factors:
A. Your topics, that is, Your webpage actions, such as like, dislike, and quick stumbles! When you arrive at a page, you do not vote for the page, but click stumble again! Button to jump to another page, which is defined as "soft not for me" or "Down-vote ").
B. socially endorsed pages are the items that your friends in the station like.
C. Peer endorsed pages is the entry that the system calculates and is liked by people with similar voting habits.
We can summarize the following points:
1. A recommendation system that can surprise users mustCapture enough user behavior details. Apparently, the third-party Recommendation System Based on Google Reader has a serious shortage of data. You cannot know which articles the user intentionally ignores, and it is difficult for you to obtain the list of friends, google does not provide dislike/hide buttons like friendfeed; you only know when to share or like Where an article is from (one of the details worth noting is that, if a user subscribes to an egg and recommends an article, it is obvious that egg is more important for users, the user only shares an article about the egg from the shared items subscription of others, but does not subscribe to it, which means that the egg may not be important to him. This details is a bit like "quick stumbles ).
2. A recommendation system that can surprise users mustHandling massive data with massive users. In March February this year, stumbleupon has already exceeded 7 million users. It is estimated that more than 10 million voting behaviors will be handled every day, and at least 30 thousand new recommendation entries will be added. There are still too few Chinese users in Google Reader, and the user behavior is too concentrated. There are too few new articles by shared items alone.
Both of these limits the strength of a third party to explore "surprises.
At present, it seems that only Twitter can provide all kinds of user behavior details and massive data without reservation.
Zheng @ playpoly Sr Beijing Report
Refer to my articles on similar topics:
1. How to measure the sharing activity of Google Reader users 20090918;
2. what's popular's cross-validation model 20090919.