Recommendation System reading notes (iii) Recommended system cold start problem

Source: Internet
Author: User

3.1 Introduction to Cold start problems

Mainly divided into three categories:

1. User Cold start: How to make personalized recommendations for new users.

2. Item Cold start: How to recommend a new item to a user who may be interested in it.

3. System Cold start: How to design a personalized recommender system on a newly developed website.

Solution:

1. Provide non-personalized recommendations: Popular leaderboard, when the user data collected to a certain time, and then switch to personalized recommendations

2. Use the age, gender and other data provided by the user to do the coarse-grained personalization

3. Log in with the user's social network account, import the user's friend information on the social networking site, and then recommend to the user the items their friends like

4. Ask the user to give feedback on some items when they log in, collect the user's interest in these items, and then recommend to the user those items that are similar to those items.

5. For newly added items, you can use the content information to recommend them to users who like their similar items.

6. In the system cold start, can introduce the knowledge of experts, through a certain high-efficient way to quickly set up a related table of items.

3.2 Using User registration information

User Registration Information: Demographic information, description of user interests, user-off-site behavior data imported from other sites

The personalized recommendation process based on registration information is basically as follows:

1. Get registration Information

2. Classify users according to their registration information

3. Recommend to users what they like in the category they belong to

Characteristics: Gender, age, occupation, can also be combined recommendation.

Recommendation algorithm based on user registration information The core problem is to calculate the user's favorite items for each feature, that is, for each feature F, calculate the user's preference for each item P (f,i).

P (f,i) refers to the popularity of item I in the user with the characteristics of F.

  

N (i) is a collection of users who like item I, and U (f) is a collection of users with feature F.

In this definition, often popular items will be in a variety of characteristics of the user has a relatively high weight. In other words, it has a relatively high | N (i) | The items will have a higher p (f,i) in each category of users, but the referral system should help users find items they are not easy to find. It is therefore possible to define P (f,i) as the proportion of the user who likes item I with feature F.

  

The purpose of using alpha in the denominator is to solve the data thinning problem. For example, there is an item that is only liked by 1 users, and this user happens to have the feature F, then there is P (f,i) = 1. However, this is not statistically significant, so we add a larger number to the denominator, which prevents such items from producing larger weights.

There are two recommended system datasets that contain demographic information, one is the bookcrossing dataset and one is the Last.fm dataset.

3.3 Choose the right item to start the user's interest

Another way to resolve a user's cold boot is to not immediately show the user the recommended results when a new user accesses the referral system for the first time, but to provide the user with some items to give the user feedback about their interest in these items and then provide personalized recommendations based on user feedback.

In general, items that can be used to initiate user interest need to have the following characteristics:

1. Compare hot:

2. Representativeness and distinction: it cannot be popular and suitable for all ages.

3. The starting item collection requires a variety:

How to design a system that chooses to start an item collection: Solve the problem with a decision tree

First of all, given a group of users, this group of users to measure the variance of the item rating this group of users of the consistency of interest. If the variance is very small, it indicates that the interest of the group of users is not consistent, that is, the goods have a relatively large degree of differentiation, and conversely, the group of users are more consistent interest.

Measure the sensitivity of an item d (i) in the following ways:

  

n+ (i) is a user collection that likes the item I, n-(i) is a user collection that does not like the item I, and the third is a user collection that has no rating on the item I.

In other words, for item I, the user is divided into 3 categories-the user who likes the item I, the user who does not like the item I and the user who does not know the item I. If the users in these 3 categories of users in the collection of other items of interest is very inconsistent, indicating that item I has a high degree of differentiation.

The algorithm first finds the item I with the highest degree of distinction from all users and divides the user into 3 classes. Then in each category of users to find the most differentiated items, and then each category of users are divided into 3 categories, that is, the total user into 9 categories, and then continue to do so, and ultimately through the view of a series of items to classify users. In cold start, from the root node to ask the user for the view of the item, and then according to the user's choice to put the user to different branches, until the final leaf node, at this time to the user's interest has a relatively clear understanding, so that users can begin to more accurate personalized recommendations.

3.4 Use the content information of the item

Usercf

First Impulse: Where the first user discovers new items

The simplest way to solve the first impulse is to randomly present the new item to the user, but not so personally, so consider using the content information of the item and put the new item first to the user who once liked it and something similar to it.

Itemcf

Every once in a while the Item Similarity table is calculated using user behavior, and the calculated item correlation matrix is placed in memory. This item is not available in memory when new items are added, so it is not recommended.

For this purpose, the item-related tables can only be calculated using the content information of the items, and the related tables are updated frequently.

In general, the contents of an item can be represented by a vector-space model that represents an item as a vector of keywords. These entities can be used as keywords if the content of the item is something like a director, an actor, and other entities.

For Chinese, first to the text word segmentation, the word flow into a word flow, and then from the word flow detection of named entities, these entities and some other important words will form a set of keywords, and finally the keywords to rank, calculate the weight of each keyword, thus generating a keyword vector.

Text----> Word segmentation----> Entity detection----> Keywords ranking----> keyword vectors

For item D, its content is represented as a keyword vector as follows:

di={(E1,W1), (E2,W2) ...}

Among them, ei is the key word, WI is the weight of the corresponding keywords. If the item is text, the weight can be calculated using the TF-IDF of the information retrieval, and if the item is a movie, it can be weighted according to how important the actor is in the play.

After a given keyword vector, the content similarity of the item can be computed by the cosine similarity between the vectors:

  

The content filtering algorithm ignores the user behavior, and thus ignores the popularity of the item and the laws contained in the user's behavior, so its accuracy is relatively low, but the result is relatively high in novelty.

If the user's behavior is strongly affected by a content property, then the content filtering algorithm can be more accurate than the collaborative filtering algorithm. However, this strong content feature is not owned by all items, and requires a wealth of domain knowledge to be available, and many times content filtering algorithms are less accurate than collaborative filtering algorithms.

However, if you can combine the two algorithms, you will be able to get better results than using these two algorithms alone.

The vector space model can get better results when the content data is rich, if the text is very short, the keyword is very few, the vector space model is difficult to calculate the exact similarity degree.

How to establish the relationship between articles, topics and keywords is the focus of topic model research.

The basic idea of a topic model: When a person is writing a document, he or she will first think about what topics are discussed in this article, and then think about what words should be used to describe the topics, and ultimately write an article in a word. Therefore, the article and the word are connected through the topic.

There are three elements in LDA, namely documents, topics, and words. Each document is represented as a collection of words, which is called the word generation model. Each word belongs to a topic in an essay. Make D a document collection, D[i] is the article I document. W[I][J] is the first j in the article I document. Z[I][J] is the topic of article J in the first document.

The calculation process of LDA consists of two parts: initialization and iteration. First of all to initialize the z, and initialization is very simple, assuming a total of k topics, then the first article of the J Word, you can randomly give it a topic. At the same time, using NWZ (w,z) to record the word w is given the number of topics Z, NZD (z,d) records the number of words in document D that are given the topic Z.

After initialization, the topic distribution is convergent to a reasonable distribution through iteration.

When using LDA to calculate the content similarity of items, we can calculate the distribution of items on the topic first, and then use the topic distribution of two items to calculate the similarity of items. For example, if the topic distribution of the two items is similar, the two items are considered to have a higher similarity, whereas the two items are considered to be less similar to each other. The similarity of the calculated distributions can be used in KL dispersion:

  

where P and Q are two distributions, the larger the KL divergence, the lower the similarity of the distributions.

Recommendation System reading notes (iii) Recommended system cold start problem

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.