If the past decade is a decade of search technology, personalized recommendation technology will be one of the most important innovations of the next decade. At present, almost all large-scale e-commerce systems, such as Amazon, CDNOW, Netflix and so on, have used various forms of recommender systems to varying degrees. And recently the "discovery" as the core of the site is beginning to emerge on the internet, such as focus on the music recommended eight treasures box, focusing on the book recommended watercress and so on. So what are some of the goals that a good referral system needs to meet?
A personalized referral system must be able to provide relevant and accurate recommendations based on previous tastes and preferences, and this taste and favorite collection must be as small as possible to require the user's labor. The recommended results must be calculated in real time so that the recommended content can be obtained before the user leaves the site, and timely feedback is given to the recommendations. Real-time is also a feature of the recommendation system that differs significantly from the usual data mining techniques. A complete recommendation system consists of three parts: Behavioral recording module, model analysis module and recommendation module. The Behavioral recording module is responsible for documenting behaviors that reflect user preferences, such as buying, downloading, scoring, and so on. This part looks simple, but it needs to be very carefully designed. For example, buying and scoring these two behaviors have different levels of potential preference. The record of behavior needs to be able to synthesize many different user behaviors and deal with the accumulation of various behaviors. The function of model analysis module realizes the analysis of user behavior record, and uses different algorithms to establish model to describe user's preference information. Finally, through the recommendation module, in real-time from the content set to filter out the target users may be interested in content recommended to the user. Therefore, in addition to recommending the system itself, a recommended set of content is required in order to implement the recommendation. For example, for a music recommendation system, a music library is such a content set. The information we need to provide for the content set itself is very low, and in the classic collaborative filtering algorithm, the content set even needs to provide an ID enough. For content-based recommender systems, we will need to provide more domain knowledge and content attributes because of the often need for feature extraction and indexing of content. In this case, music examples, such as singers, genres and other attributes and audio information becomes the necessary content set information.
Collaborative filtering (collaborativefiltering) technology is by far the most successful technology in personalized recommender systems. At present, there are many large-scale web sites on the Internet, which have already applied this technology to the user's more intelligent recommendation content. If you want to study collaborative filtering, you must not Miss Movielens (http://movielens.umn.edu/). It is one of the most famous research projects in collaborative filtering. The first generation of collaborative filtering technology, also known as user-based (user-based) collaborative filtering. Based on user's collaborative filtering, the basic principle is based on the relevance of user behavior selection. User behavior Choice here refers to the download, purchase, evaluation and so on can explicitly or implicitly reflect the behavior of user preferences. In a typical recommendation system based on collaborative filtering technology, input data can usually be expressed as a user Content matrix of MXN R,m is the number of users, n is the number of content. The value of a matrix is related to the type of content, usually determined by the behavior logging module. If the content is a book in the online bookstore, then the value of the matrix can indicate whether the user buys or not, for example, 1 means purchase, 0 means no purchase, or indicates how high the user evaluates it, so that the evaluation value can have several levels, such as a common evaluation system of the level of the rate. Based on the user's collaborative filtering, a group of users with similar preferences can be identified by comparing a series of behavioral choices between the target user and other users, which may also be called "affinity". Once the system is able to identify a user's favorite users, they will be able to recommend their most interesting content as the current user recommended results to the user. In other words, the previous behavior will likely be similar to yours in the future, as it chooses users who are similar to you. So use these users as benchmarks to recommend content to you. The core problem of collaborative filtering is to find a group of users who are interested in the target user. This similar user is often referred to as the nearest neighbor (Nearestneighbor). The similarity between users is obtained by comparing the behavior selection vectors of two users. At present, there are many kinds of similarity calculation methods for comparison behavior selection vectors, and the classical algorithms include Poisson correlation coefficients (personcorrelation coefficient) and cosine similarity (cosine-based similarity). When "nearest neighbor" is generated, we are able to calculate the set of content (also called TOPN recommendation set) that the user is most likely to be interested in. In order to get the recommendation set, statistics "recent neighbors" in the user's interest in different content, take the top of the content as the recommended set. Here is a simplified example: if the user Zhang San has two enthusiasts: John Doe and Harry. Zhang San likes to watch movies A; John Doe likes to watch movies A,b,c and D; Harry likes to watch movies A,b,d,e and F; So the recommender system can filter out the power that similar users likeShadow B and D are recommended for Zhang San as Zhang San's most likely favorite movie. User-based collaborative filtering technology has achieved great success in personalized recommender system, but it has its own limitations. The way a recommendation set is generated means that a content is only available to other users after it has been selected (purchased) by the user. For an online bookstore, the new book because it has not been a significant number of users to buy or review records, there is little opportunity to be the user's "nearest neighbor" filter into the recommendation set. This problem is also known as the "cold start" problem of collaborative filtering. In addition, because the user's similarity is calculated by comparing the historical behavior record of the target user with the records of every other user, extensibility becomes a very serious problem for a realistic recommender system. Imagine that for a Web site with millions of users, each computing user would involve millions of comparisons, not to mention the overhead of a lot of database IO operations. The second generation of collaborative filtering technology based on content items (item-based) is produced. Unlike user-based technology, this approach compares the similarity between a content item and a content item. The Item-based method also requires three steps to get the recommendation: 1) Get the historical scoring data for the content item (item), 2) calculate the similarity between content items for the content item, find the "nearest neighbor" of the target content item, and 3) generate the recommendation. The similarity between the content items here is obtained by comparing the user behavior selection vectors on the two content items. For example, suppose users and content items are as follows:
|
Movie A |
Movie B |
Movie C |
Movie D |
Tom |
Like |
|
|
|
John doe |
Like |
Like |
Like |
Like |
Harry |
Don't like |
|
Don't like |
Don't like |
Zhao Liu |
Like |
Like |
|
Like |
As you can see, movie A and d are the most similar. Because Zhang San likes a, so movie D can be recommended to Zhang San. Compared with the user-based recommender system, the biggest improvement of the recommendation system based on the content item is more extensibility. Content-item-based methods are used to calculate similarities between content items instead of similarities between users. For a typical Internet application, the number of content items provided is relatively stable. For example, a large online bookstore, the number of books may be sold up to hundreds of thousands of, and the number of users may reach millions of. Therefore, compared to the user, the similarity calculation between content items requires much less computation, which greatly reduces the amount of on-line computation and improves the performance of the system. The most successful application of recommendation systems based on content items is Amazon. Amazon also applied for a patent called "Collaborativerecommendations using Item-to-item similarity Mappings"[1]. Of course, while reducing the amount of computation, the recommended technique, which is based entirely on content items, also makes a small sacrifice in the recommended accuracy. In most cases, user-based recommendation techniques are slightly better than content-based methods. This is because the content-based approach ignores the group characteristics between similar users. Whether it is the first generation based on the user method or the second generation of content-based method, it is unavoidable to encounter the problem of sparse data. In any site, the user's scoring record or purchase record is a small part of the entire set of content to choose from. So in many recommender systems, the amount of data involved in each user is quite limited, and in some large systems such as Amazon, the user has rated up to 1% of millions of books, resulting in fairly sparse evaluation data. When the user has not found the intersection between the content, it is difficult to determine whether the user tastes similar, it is difficult to find similar user sets, resulting in a significant reduction in the recommended effect. In order to solve the sparse problem of user data, the most convenient way is to set the user's rating for a content item that has not been selected as a fixed default value, such as the user's average rating. There are many ways to predict missed scores in the industry, but the simplest method is generally used to improve the accuracy of the collaborative filtering recommendation system. On the other hand, even with a content-based approach, computational complexity is still a performance bottleneck when the volume of data is huge. In order to further solve the problem of extended performance of collaborative filtering technology, it is more effective to do a cluster analysis (clustering) on user scoring data at present. Cluster technology first assigns users with similar interests to the same category. After clustering is generated, it either restricts the nearest neighbor search object to the closest cluster, predicts the target user's evaluation based on the evaluation of other users in the class, or extracts the recommended results using the center of the cluster as an approximation. Due to the relatively small changes in the classification between users, clustering process can often be done offline, without real-time calculation, which greatly reduces the real-time recommended calculation pressure, improve the speed of the recommendation system. In general, the clustering of the user into how many classes, the recommendation of the overall speed of the system can increase the number of times. The specific choice of clustering algorithm will vary depending on the application domain and the distribution characteristics of the data. If the clustering algorithm is chosen improperly, it will reduce the recommended accuracy. In recent years, the recommendation system of the development of algorithmic technology has some new direction, such as SLOPEONE,SVD, and so on, not listed. A particular key to a commercial recommender system is the processing of massive amounts of user data. Since the recommendation system is data-first, the more data accumulates, the better the recommended accuracy. And when the user's behavior data really accumulated to millions or even hundreds of billions, how to arrive at a reasonable time in the effective recommendation, is the most test of recommendation technology. In addition, an excellent recommender system needs to be able to combine content similar to user behavior. The traditional collaborative filtering method is to ignore the property of the content itself, which is the advantage of less data requirements, but on the other hand, it also brings an unavoidable "cold start" problem. In fact, with the wide application of the label system on the Internet, the label itself is a good content property, how to useis worth discussing. Taking full advantage of the properties of the content itself and combining the different similarities will bring new impetus to the recommendation technology based on collaborative filtering. Finally, a well-designed recommendation technique can be adapted and learned from the user's feedback on the recommended content. Because in fact each user for the recommended content has different requirements, such as some users may prefer to compare popular content, some users are more willing to find unpopular content. According to the feedback of different users to learn the characteristics of each user, it is possible to avoid the inherent deviation of the algorithm, and obtain a more ideal effect.
This article is for the Music Eight Treasures Box (http://www.8box.cn/) Co-funder greatly provides
Discussion on personalized recommendation