Recommendation system--content-based recommendations

Source: Internet
Author: User
Tags keyword list idf


"Recommender System an Introduction", chapter III, content-based recommendations.


Overview


If the problem of item selection in collaborative filtering can be described as "recommending something similar to a user's liking", the content recommendation can be described as "a product similar to what the user used to like". Therefore, the task of recommending the system is to predict whether the user likes items that they have not seen (based on user records).

Content-based recommendations must rely on additional information about items and user preferences, but it does not require a large user base or scoring record, which means that only one user can generate a referral list.

In actual production, it takes a lot of cost to obtain the characteristics of the items manually.

In text document recommendations, such as news or Web pages, the vast majority of the basic assumptions are that the features of an item can be automatically extracted from the document content itself or from the unstructured text description. Therefore, a typical example of a content recommendation system is to recommend new articles by comparing the main keywords of the candidate articles with the keywords appearing in other articles that the user has highly evaluated in the past. Accordingly, the items that can be recommended are often referred to as "documents".

Content-based and knowledge-based recommender systems do not have clear boundaries, and some authors even believe that content is based on a subset of knowledge. In the traditional classification scheme, the content-based recommendation system is characterized by the emphasis on the use of item description information, and based on the knowledge recommendation system, there will be some additional causal knowledge to generate recommendations, such as a practical function.

In this section, we discuss content-based recommendations, which focus on items that recommend text descriptions, and can automatically "learn" user records (knowledge-based recommender systems usually display the preferences of the user who is asking for them).


Content representation and similarity


The simplest way


Item Characteristics: Maintain a detailed list of the characteristics of each item (also called attribute set, special solicitation or item record), for example, for recommended books, can be genre, author, publishing house, etc.;

Preferences record: Through the inquiry, user ratings, or analysis of documents, to obtain user preferences records.

Thus, the recommended thing to do is to match the characteristics of the item with the user's preference.

Content-based recommendations generally work by evaluating the similarity of items that users have not yet seen to the items they are currently interested in. The typical similarity measure method uses the dice coefficient, which is bound to fit the multi-valued feature set. In principle, according to the actual problems, all kinds of similarity measurement methods are feasible.


Vector space model and TF-IDF


Strictly speaking, the above method, about the publishing house, the author and so on can not be counted as the "content" of books, can only be counted as a matter of additional knowledge. Historically, content-based recommender systems have been used to filter and recommend text-based items, such as e-mail or news.

The criteria for content recommendation is not to maintain a column of "meta-information" features, but rather to use a list of relevant keywords that appear in the document.

So, the main idea is to be able to automatically generate such lists from document content or text descriptions without restrictions

Document Content-The method of the keyword list:

(1) Maintain a list of keywords for the document, and a similar list of user records. Then the calculation of interest and the degree of overlap of documents can be recommended. The drawback is that the longer the document, the greater the chance that the word will overlap, and the more likely the recommendation system will be to recommend long documents.

(2) TF-IDF conversion, TF-IDF is a mature technology in the field of information retrieval, representing the frequency of word and anti-document. This is a standard formula for ... (Wherever there is a formula, the first skip, in the algorithm section added:)) ... In the TF-IDF model, the document is not a Boolean vector representing each keyword, but a vector of TF-IDF values that are calculated.


The improvement and limitation of vector space model


(1) Stop word and stem restore

(2) Streamlining the scale

(3) Phrases

(4) Limitations: The context of the keyword is not taken into account, and in some cases the meaning of the description is not properly reflected.


Retrieval based on content similarity degree


The most common technology that relies on the "vector-spatial document Representation model":


Nearest neighbor


K Nearest Neighbor Method (KNN), the advantage: relatively easy to implement, can quickly adapt to the recent numbering, as long as there are relatively few scoring data can be a certain recommendation, etc. disadvantage: The pure KNN method is less accurate than other more complex techniques.


Correlation Feedback--rocchio Method


The idea originated from the groundbreaking information retrieval system Smart in the late the 1960s. Smart is characterized by: users can not only submit to the system based on keyword queries, but also to feedback whether the results are related. With the help of feedback, the system can expand the query terms in nature and improve the results of the next round of search queries.

Understanding can be, no practical first not in-depth study.


Other text classification methods


There is also a way to determine whether a user is interested in a document, and to classify such issues as categorized tasks, divided into "likes" and "dislikes". After the content-based recommendation task is represented as an ingredient class problem, a variety of standard (supervised) machine learning techniques can be applied in principle, such that an intelligent system automatically determines whether a user is interested in a document. Supervised learning means that the algorithm relies on ready-made training data.


A method based on probabilistic model


Early morning text classification system the most famous classification method is the probability theory. These methods are based on the naïve Bayesian method (the word appears in the document) that obeys the conditional assumptions and are successfully deployed in the content-based recommender system.


Other linear classifiers and machine learning


(1) Widrow-hoff algorithm

(2) Support vector Machine (SVM)


Display decision Models


The other two kinds of learning techniques, which have been used to build content-based recommender systems, are: Decision tree and rule induction, which is unique in the training stage to generate the display decision model.


Feature Selection


All of the above techniques rely on vectors to represent documents and IF-IDF weights. When used directly, the document vectors are still very long and sparse (with only a few words in each document), resulting in performance, memory problems, and overfitting, after removing the stop words and restoring stemming.

It is necessary to categorize only one subset of all the words in the document set. The process of selecting a subset of the available words is called feature selection.


Contrast and limitation


Contrast


In the laboratory small data volume, the correct classification of the document proportion as a measure of accuracy standards. The algorithm compares the results as follows:

The decision tree Learning algorithm does not perform well under the given conditions, and the "nearest neighbor" method behaves poorly in some areas, and the Bayesian and Rocchio methods always perform well in all fields, and there is no obvious difference; the neural network method using nonlinear activation function is not obvious.

The above is only the laboratory evaluation, demonstration, the specific production of the effect assessment, is the need for an objective assessment of the program and Apple results to support, different scenarios will vary.

The results of the assessment seem to have a special effect on the Bayesian algorithm, saying that the test area is performing well (even if the hypothesis of conditional independence is not), and that its learning and forecasting process is relatively fast. It appears that only Boolean document notation (rather than TF-IDF weights) is used in the classifier and does not significantly affect recommendation accuracy.


Limitations


(1) Shallow content analysis: such as the recommended Web page, in addition to text content, there are many other aspects, such as aesthetics, aging, pictures, audio and video.

(2) Recommended results novelty: Based on the learning method will soon tend to give the same recommendation, will recommend the current user positive evaluation of items similar to the items, resulting in recommended Da Lu Huo. Therefore, it is suggested that both filter and filter too similar, the overall goal is to improve the recommendation list of surprises, because the unexpected items on the user's low value. The simplest way to avoid monotony is to "randomly insert records."

(3) Get a rating: Cold start problem still exists. In all filtration technologies, the recommended accuracy increases with the number of ratings, and studies have shown that the effect of the learning algorithm increases significantly when the number of ratings is between 20-50. In many areas, the initial stage, you can ask the user to provide a list of keywords, can be selected from the topic lists, or can be entered freely in the text box.


Summary


Based on content recommendation, there is no need to use user group information compared to collaborative filtering, but you need to get user preferences by showing feedback or implicit feedback, and also consider how to treat new users.

Based on the content recommendation approach, most methods are derived from the field of information retrieval (IR), because typical IR tasks for information filtering or text categorization can be considered as a recommended application. These methods are based on explicit or implicit feedback to learn the model of user interest preference. With the help of various machine learning techniques, it is possible to achieve good recommendation accuracy.

The typical difference between content-based and knowledge-based recommendations is that content-based recommendations generally target text documents or other items that can automatically extract features, and some learning techniques are used, and knowledge-based systems rely primarily on external information.

Recommendation system--content-based recommendations

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.