The following is a paper note, in fact, mainly excerpt, this piece of doctoral dissertation is logical, layers in depth, so I keep more.
See the second chapter, I found in fact this piece of article for me More is science, science Bar ...
First, the source of the paper
Personalized Web recommendation via collaborative Filtering (very strange via why lowercase, remember first)
(candidate) PhD student: Sun Hui
(advisor) Mentor: Junliang (Academician)
(Academic degree applied for) degree level: PhD in Engineering (Doctor of philosophy)
Second, Abstract
A new similarity algorithm (JACUOD) is proposed, which takes into account the difference between vector length and dimension of different vector space.
A normalized reduction CF algorithm is proposed. Experiments show that the method has high precision.
Finally, a personalized open Interface for cloud environment (recommended, this paper proposes a user clustering collaborative filtering method. This approach is inspired by the idea that measuring the similarity of open interfaces should be based on users with similar preferences. It first divides the user into a user group according to the user's preference, then makes the collaborative filtering based on the user group. It is very extensible because it can be executed in parallel between multiple user groups. In addition, it also has a good predictive accuracy. The experiment based on real interface data sets proves the effectiveness of this method.
Third, Introduction
World Wide Web, search engines from directory index to full-text index, because the catalog index needs manual maintenance (feel like a library in the picture method), then the site too much to do? Full-Text Indexing (Web page crawler to create inverted index, the user by keyword query), then when the user needs are more ambiguous or poorly organized keywords, the results of the search is not satisfactory, the recommendation system was born.
In essence, recommender systems and search engines are powerful tools to help users get effective information in the age of the explosion. However, different from the search engine needs users to clearly put forward their own needs, the recommendation system by analyzing the user's historical data to model the user's interests, and thus help users to identify some potential, perhaps the user is not aware of the needs, and according to these needs to recommend relevant information.
According to the long tail theory, as long as the product storage and circulation of enough channels, the demand is not prosperous or poor sales of products together occupy the market share can and those few hot products occupy the market share of rival or even greater, that is, many small markets converge to produce with the mainstream market rival energy.
The author notes: The Long Tail theory is a new theory of the rise of the network times, because of the cost and efficiency factors, when the commodity storage circulation show the venues and channels are broad enough, the cost of commodity production has plummeted so that individuals can produce, and the sale cost of goods sharply reduced, almost any previously seemingly low demand products, Whenever there is a sale, someone will buy it. The common market share of these demand and low-volume products can be comparable to or greater than the market share of mainstream products.
Have to spit out the garbage of the western economics teachers, lectures to see the ceiling, the examination of the full professional 65 60 points, 2 more than 90, 4 80 and 6 more than 70, 10 failed, I tested 61 points. Sister's Ah, that year I always ranked the second highest 7 points, only to a scholarship.
For example, when selling products, manufacturers focus on a few so-called "VIP" Customers, "no time" to take into account the majority of ordinary consumers in the number. In the network age, because the cost of attention is greatly reduced, people may pay attention to the "tail" of the normal distribution curve at a very low cost, and pay attention to the overall benefit of "tail" even more than "head".
The 28 law is concerned with the red head of the figure (which, in the view that this part of the product only accounts for the total product, but contributes a lot to the sales volume), should only retain this part of the product and discard other products. Long-tail theory is concerned about the long tail of the yellow, this part of the product can accumulate into large enough, or even more than the red head product market share. The high volume of products sold on the red head indicates that they cater to the common needs of the vast majority of customers, while the long tail portion of the product sales is low, indicating that they each only meet the individual needs of a small number of customers. If you can reasonably control the cost of the case, the long tail part of the product sales to users, then the resulting profit will be comparable to the red head or even greater. The profit of each product is proportional to the sales volume, because there are storage and circulation management costs, when sales are below a certain limit will cause losses, so the physical stores tend to sell products in the red head. In the internet age, online stores on the shopping site maintenance costs far lower than the physical store, can be around 0 local increase products, in the yellow long tail part of the product brought huge market share can finally be converted into a huge profit in the online store, so the online shop to the long tail part of the product put on their own page to sell. We put in Wu long tail part of the product called "Long tail products", the sales of long tail products produced by the profit is called "Long tail profit." To achieve long-tail profits, only the long tail products placed on the page to sell is obviously not enough, because the long tail product variety, and on the long tail product of a certain product only a very few users interested in it, so for users in the vast number of long tail products are almost impossible, And not willing to take the initiative to find a few target products.
Recommended Systems field an earlier job was the film recommendation system developed by the research team at Minnesota University in the United States. The system first lets the user to give the film the rating, then uses the user's scoring information to model the user's interest, on this basis for the user to recommend those who the user may be interested in and has not seen the movie. Amazon, an e-commerce company, took the lead in the commercial application of the referral system by analysing the user's browsing and buying behavior to predict which products the user might be interested in and recommending them to the user, and Amazon's sales increased.
The advantage of collaborative filtering is that it is universal, it does not depend on the specific content, but only with the user's preferences.
In recent years, there have been many research on perceptual service methods, such as service selection method and service fault tolerance method, which is the precondition of the service data acquisition. To obtain service data, you need to monitor the service in a different physical location. This is a distributed task, and the network environment is different in different physical locations. Moreover, in some cases, the method of obtaining data through direct monitoring services is not feasible (for example, if there are many candidate services, the cost of time, resources, etc. will be very expensive if you monitor them individually), or because the invocation of the service is a charge. Therefore, it is very attractive to avoid the QoS data of the service in the way of real invocation and in the way of forecasting. Service data forecasts are designed to make use of a small amount of available information (e.g., user information, service information, user characteristics, service characteristics, etc.) to personalize the value forecast for service users. The predicted data can be used for service selection and recommendation, service mix, fault-tolerant for service-oriented systems, etc. Therefore, it is a very necessary problem to study how to predict the value of the service as accurately as possible.
Facing so many functions equivalent open interface, facing such a huge third party user group, how can we find suitable shipowner for each user? In order to solve this problem, personalized open Interface recommendation in cloud environment has become a necessary research problem.
Based on the above analysis, in order to promote the development of personalized recommendation, we need to provide a more effective evaluation-based product recommendation mechanism, the perceived service recommendation mechanism, the Open Interface recommendation mechanism in the cloud environment. This paper presents a method to solve these problems. The first approach is to achieve more accurate evaluation-based product personalization recommendations through a new similarity algorithm for collaborative filtering. The second approach is to achieve more targeted service personalization recommendations through a new collaborative filtering. The third method is to realize effective perceptual open Interface personalization recommendation through collaborative filtering of user clustering. These three methods constitute the contribution of this paper.
In order to realize the personalized open Interface recommendation in the cloud environment, we designed the user cluster collaborative filtering. Our approach is inspired by this: measuring the similarity of open interfaces should be based on users with similar preferences. This method first divides the user into a user group according to the user's preference, then carries on the item-based collaborative filtering within each user group. This approach is very extensible because it can be executed in parallel across multiple user groups, which is attractive for cloud environments with a large number of users.
Iv. Background Review
4.1 Recommended Systems
In 1997, the recommended system was divided into three types of recommendations based on content, collaborative filtering recommendations, combined content and collaborative filtering. Later, with the development of technology, especially the social network, there was a recommendation based on social relationship.
User preferences can be obtained either explicitly (such as questionnaires) or implicitly (such as analyzing a user's historical behavior).
Content-based recommendations are rooted in the research of information retrieval ' and information filtering. Many content-based recommender systems focus on recommended objects that contain textual information such as news, Web pages, and so on. Usually we extract a series of features from the content for recommendation. Content-based recommender systems usually recommend text-based these are usually represented by a series of keywords. The TF-IDF algorithm is described below.
In addition to heuristic methods based on information retrieval, content-based recommendations also employ a number of machine learning techniques, such as Bayesian classifiers, clustering, decision trees, and artificial neural networks. These techniques differ from information retrieval techniques in that their predictions of utility functions are not based on heuristic methods such as angle cosine similarity, but rather on models derived from statistical learning or machine learning from their own data. For example, based on a labeled collection of Web pages, where each page in the collection is labeled as relevant or irrelevant by the user, the naive Bayesian classifier is used to classify pages that have never been labeled (divided into related, unrelated categories).
The hypothesis of collaborative filtering is that many users give similar evaluations to some items, and similar evaluations are given to other items. It's a truth that people in real life ask like-minded friends to recommend things, and it implies that people with similar interests have the same interest in other things.
Collaborative filtering is a popular and effective recommendation method, and one of its outstanding advantages is its ability to handle unstructured complex objects.
1.tf-idf
Content-based recommender systems usually recommend text-based these are usually represented by a series of keywords. For example, as a system that recommends Web pages to users, the content recommendation component uses one of the most important keywords to represent a Web page. The importance of keywords can be expressed by its weight, the most well-known one method of measuring the weight of keywords is the frequency of the inverse document. Suppose that there is a document that can be used to recommend to the user, where a document contains a keyword set to the number of occurrences of a keyword in a document, then the word frequency in the document is defined as:
The author notes: The TF of keyword i is the keyword I divided by the number of occurrences of the most frequently occurring keywords in this piece of document.
2. Similarity algorithm
The best is obtained by multiplying the Jaccard coefficient by the Pearson coefficient, JACPCC is better than Jaccos, COS and PCC.
Ahn and other people put forward the PIP similarity algorithm to alleviate the cold start problem in collaborative filtering. Cold start problem refers to when the new user or new join in the collaborative filtering recommendation system, because the new user or the scoring data is extremely rare or not, resulting in the user not to get the recommended results or should not be recommended, according to Ahn and other people's experimental results, the similarity in cold start problems, the effect is significant, But in general occasions (non-cold start) effect is not as good as PCC.
For collaborative filtering recommender systems, similarity algorithms are essential to determine the degree of correlation between users (or between products). In the field of collaborative filtering, it is a very important research point to construct a reliable and feasible similarity algorithm. We analyzed the traditional similarity algorithms and found their drawbacks: ignoring the difference in the number of lengths between user vectors (or between product vectors), ignoring common scoring products or co-scoring users). In order to overcome these shortcomings, we propose a new similarity algorithm--jaccard uniform operator distance similarity algorithm, which is based on Euclidean distance and considers the characteristics of spatial similarity measurement of different dimension vectors. Our approach takes into account the differences in vector lengths, the number of common-scoring products, and the number of products being scored. Compared with the traditional method, the experimental results show that our method has better prediction accuracy.
I note: Have seen the modified COS is each value first minus the average, in the use of Cos; others say it is minus the median (the average of the maximum and minimum values, which is the theoretical value, such as the scoring system is 1 to 5 points, so long minus 3 points. The purpose of correcting the similarity of cosine is to solve the similarity of the cosine similarity in the direction of the vector dimension and not take into account the difference of the dimensions of each dimension, so when calculating the similarity, the correction operation of each dimension minus the mean is done; the cosine similarity does not take into account the user scoring scale problem, such as in the scoring range [ 115] In the case, the user a rating of more than 3 is their favorite, and for the User B, the rating of more than 4 is his favorite. By subtracting the user's average score from the item, the corrected cosine similarity measure improves the above problem by subtracting the mean value from the user-defined value (the previous article says that some users tend to have bad reviews, very good to give 4 points).
Adjcos and PCC are not the same, the difference is that the way to center is different.
A series of counter examples are found below, and the shortcomings of various similarity calculation methods are pointed out, and a new method is proposed.
The author is curious about how to find these special cases, or how to generate, there is a previous look at the similarity of the calculation did not evaluate the 0, this piece does not count, the results will be different.
V. Propose a new algorithm Jacuod (Jaccard uniform operator Distance similarity algorithm)
The author's doubts are in the content of it.
Similarly, product similarity can be defined
Verification uses 10 cross-validation, divided two data into data1 data2 (I Own), and then respectively to the user-based and item-based algorithm in different number of neighbors K to different datasets when the performance (MAE) comparison. The directory is as follows:
1.user-based
1.1data1, Different k
1.2data2, Different k
2.user-based
2.1data1, Different k
2.2data2, Different k
The values for different dataset K are different, but for 1 and 2 each time the value of K is always, 1.1 and 2.1, but 1.1 is not the same as 2.1.
Because 10 percent cross-validation is used, K takes 10 values.
In the future, I will verify this way, and then look at how the original author analyzed the results of the experiment.
Then verify the effect of the regularization factor and the Jaccard factor (that is, the Jacuod and UoD compare to verify the Jaccard, UoD and Ed (Euclidean distance) to verify the regularization factor), the experimental steps ibid.
And then verify the influence of K, found that is the hook curve, the author thinks that this is a better k value, when applied directly with the value of the line. Writing a paper can not be so simple, look at the original author's conclusion.
Re-verification avoids the effect of 0 divisor function
This section of the author did not do experiments, or did not write, is not the author lazy, or do not want to do at all, their own conclusions.
1. Verification with 10 cross-validation
The English name is called 10-fold cross-validation, which is used to test algorithm accuracy. is a common test method. The data set is divided into very, in turn, 9 of them as training data, 1 as test data, for testing. Each test will be given a corresponding accuracy (or error rate). The average of the accuracy of the 10 times (or error rate) as an estimate of the precision of the algorithm, it is generally necessary to perform multiple 10 percent cross-validation (such as 10 10 percent cross-validation), and then the mean value, as an estimate of the accuracy of the algorithm.
The choice to divide the dataset into 10 is due to a large number of experiments using a large number of datasets and different learning techniques, suggesting that 10 percent is the right choice for obtaining the best possible error estimates, and there are some theoretical evidence to prove this. But this is not the final diagnosis and the controversy persists. And it seems that the results of 50 percent or 20 percent and 10 percent are comparable.
Vi. QoS-Aware service recommended normalized reduction collaborative filtering
belongs to the service computation, does not do the research
VII. user cluster Collaborative filtering recommended for open interfaces in the cloud environment
Involving service computing
Personalized Web recommendation based on collaborative filtering