Recommended algorithms
At present, the mainstream recommendation algorithm mainly includes content correlation algorithm, collaborative filtering algorithm.
Content correlation Algorithm (content-based)
The principle of the CB algorithm is to be an item of the basic properties, content and other information extracted, a taglist, for each tag to assign a weight.
The rest is very similar to a search engine, taglist all the item corresponding to the inverted-row conversion, and put it in the inverted Index server to store it.
When you want to make a recommendation about an item, taglist the item corresponding to a query expression in a similar search system, and then sort the results of the recall as the recommended output.
When you want to make a personalized recommendation to a user, take the item list that the user has recently liked/manipulated, take out the item's taglist and merge it as a user model, and pour the taglist request of the model into the Indexing Service. The result of the recall is recommended to the user as a candidate.
The benefits of this algorithm are:
- Do not rely on user behavior, that is, do not need a cold start process, at any time to be able to recommend
- Can give a plausible explanation of the recommendation
- Item recommended timeliness can be very high, such as news products need to use the algorithm
The disadvantages of this algorithm are:
- Need to understand the content of the item, audio/video and other bad parsing content is not good to deal with
- It is more complex to deal with the situation of one-time polysemy and one-word-multiple words.
- Prone to homogenization of serious problems, lack of surprises
Introduction to the Collaborative filtering algorithm (collaborative filtering)
The principle of CF algorithm is to summarize all the behavior of <user,item>, and use collective wisdom to make recommendations. Its principle is like a friend recommendation, for example, through the user likes the item analysis, found that user A and User B very much like (they all like the same thing), User B likes a certain item, and user A does not like, then the item is recommended to user A. (user-based CF)
Of course, there is another dimension of co-recommendation. That is, comparing all the data and discovering that Itema and itemb are very much alike (they are liked by similar people), then they pull all the item that user a likes into the item list, pulling them out like a recommended candidate recommendation to user A. (item-based CF)
As said above are personalized recommendations, if it is related to the recommendation, directly take Item-Based CF the intermediate result is good.
The benefits of this algorithm are:
- Can play an unexpected recommendation effect, often can recommend some surprise results
- Make a valid long tail item
- Rely on user behavior only, do not need to have a deep understanding of the content, use a wide range
The disadvantages of this algorithm are:
- A large amount of behavioral data is needed at the outset
<user,item> , which requires a lot of cold boot data
- It's hard to give a reasonable referral explanation.
Principle
When the collaborative filtering algorithm is implemented, it is divided into two typical categories:
-
Domain-based collaborative filtering algorithm
The main idea of such algorithms is to use the <user,item> scoring matrix to calculate the similarity between the user and the user, item, and item using statistical information. Then the similarity is used to sort the results and the recommendations are finally reached.
Common algorithm principles are as follows:
-
user-based CF
First look at the formula:
The formula calculates the similarity between user I and User J, I (IJ) is on behalf of user I and User j jointly evaluated items, R (i,x) on behalf of user I on item x, R (i) has a bar on behalf of the user I all scores of the average score, the reason is to subtract the average score is because some users score strict loose, normalized user ratings to avoid mutual influence.
This formula does not take into account that popular products may be liked by many users, so you can also optimize the weight, here does not show the formula.
In a real-world production environment, a similar algorithm is often used for Slope one , which calculates the scoring deviations, the items that will be evaluated together, and subtracts the respective scores by averaging them.
-
item-based CF
First look at the formula:
The formula is similar to user-based CF and is no longer repeated.
Such algorithms face two typical problems:
- matrix sparse problem
- scalability issues due to limited computing resources
Based on this, experts and scholars have proposed a series of model-based collaboration Filter algorithm.
Model-based collaborative filtering algorithm
Model-based research is more common:
- Based on matrix decomposition and latent semantics.
- Bayesian Network-based
- SVM-based
Here is a brief introduction to the proposed algorithm of latent semantic model based on matrix decomposition. The algorithm first fills the sparse matrix with a mean, and then decomposes it into two matrices by using matrix decomposition, such as:
Look at a practical example:
In this example, the original matrix contains the relationship between the title of the page and the term after the cut, which can be likened to the rating in the recommendation system. Then using SVD to do matrix decomposition, for each term will correspond to a 3-dimensional vector, for each title will also correspond to a 3-dimensional vector.
Then the next thing you can do is a lot, if you want to calculate the similarity between the term and title, only the two 3 as a vector to do the inner product of the score can be;
You can also project both term and title into these 3-dimensional spaces, and then use a variety of clustering algorithms to find the categories of users and item, item, item, user, and user.
The core of the algorithm is to do matrix decomposition, in the case of large matrix of the calculation is very exaggerated, in the actual production environment will be commonly used gradient recursive descent method to obtain an approximate solution.
Portfolio recommendation Technology
In fact, from the perspective of practice, no one of the recommended technology to say that they have no drawbacks, often a good recommendation system is not just a recommendation technology to solve problems, often are combined to compensate each other's shortcomings, the common combination of the following:
- Hybrid recommendation technology: Using a variety of recommended techniques and weighting the optimal;
- Switch recommendation technology: Use different recommendation technology according to user's scene;
- Feature Combination recommendation technology: Put the output of one recommendation technology into another recommendation technology;
- Cascade Recommendation Technology: A recommended module in the process of obtaining results from another recommendation module for its own output;
ITEM-CF and USER-CF selection
- User and item quantity distribution and frequency of change
- If the number of user is much larger than the item number, it would be better to use the ITEM-CF effect, because the same item will have a higher score, and the amount of computation will be relatively small.
- If the item number is much larger than the number of user, then the USER-CF effect will be better, for the same reason
- In the actual production environment, it is possible that the user is not logged in, and the cookie information is extremely unstable, resulting in only using ITEM-CF
- If the user's behavior changes slowly (like fiction), the results will be more stable with USER-CF.
- If the user behavior changes quickly (such as news, music, movies, etc.), with ITEM-CF results will be more stable
- Correlation and surprise trade-offs
- item-based more biased results, which may appear to be more similar.
- User-based out of the more likely to have surprises, because looking at the similarities between people, the introduction of the results may be more surprising
- Data update frequency and timeliness requirements
- For item update timeliness of products, such as news, can not directly adopt item-based CF, because CF is required to batch calculation, before the calculation results come out of the new item is not recommended, resulting in low data timeliness;
- But you can use USER-CF, and then record an online user item behavior pair, you can according to the user's recent similar behavior of the user-sensitive item recommendation;
- For such as film and television, music or the like can be used item-cf;
Common recommended algorithms for Popular Science