Source: http://www.cs.utexas.edu /~ ML/papers/libra-sigir-wkshp-99.pdf
Introduction
This article will introduce in detail the symbols, terms, formula derivation, and core ideas of the content recommendation algorithm based on polynomial Bayes, and learn how to implement item recommendation from the perspective of text classification. After learning about the algorithm process in detail, you should be able to use the formula to calculate a user's profile for the word level ), based on the intensity, we recommend other items (we need to use the intensity to calculate the preference intensity of the item by weighting the item, in this way, you can obtain the sequence of items that you may like most.
Symbols and terminology
Item: the user's score and objects to be recommended.
Solts: Some attributes of an item. For example, if an item is a book, solts may be the title, author, related-title, related-author, abstract, and user comment of the book. These attribute initial values are generally natural language texts, which need to be processed by some means, such as Word Segmentation and stem extraction. In the end, each solts will get a bag of words, called documents, abbreviated as D or D.
Words: Words
Core Ideas:
The text classification is considered to be like and dislike, and the user's preference intensity matrix for words in each slot is calculated based on this value to calculate the user's preference intensity for items.
Detailed algorithm steps
1. Item Feature Analysis and Text Processing
In this phase, you need to create multiple attributes for the item, such as merging identical or similar solts,
Obtain the natural language text string for each attribute, and perform word segmentation, stem extraction, and other operations on the text to form the documents that can be used by the algorithm.
This item is not limited to the book just mentioned, but also a link URL. The attribute may be the keyword, description, abstract, and website of the webpage where the URL is located, this attribute is determined based on the characteristics of your item.
2. Learning a profile
In this process, the items to be analyzed are only the items associated with the user. Therefore, the amount of computing is much smaller than that of traversing the entire dataset. This correlation can be manifested in multiple forms. It can be that the user has commented on the item, or just accessed the URL of this link, but no matter what the related method is, we only regard it as two categories, like and dislike. For example, if the score is 1-10, 1-5 means yes, and 6-10 means no. If it is a URL, access is preferred; otherwise, access is disliked.
Why is it considered as only two categories?
Compared to predicting the rating value of an item attribute, this algorithm only needs to get the ordered list of an item attribute (the top of the list with a high score ), therefore, the classification task is converted into a probabilistic binary categorization problem to predict whether the item is a user's favorite.
Formula Derivation
In this algorithm model, item features are not a bag of words, but a vector of bags. Therefore, Naive Bayes cannot be used for classification. However, for a certain solt, polynomial Bayes can still be used for classification. Assume that vocabulary in the vocabulary of documents is V, which is a word in the vocabulary and the probability of Category J. Then, according to the naive Bayes formula:
We can replace our algorithm model with the formula (1), that is, d by replacement, and add a given category J to the formula above.
In addition, based on the fact that an item is composed of solts, we can obtain that the probability of item B is the product of the probability of all solts that constitute item B, this equation is also true under the condition of given category J. The formula can be deduced (3)
Finally, use the naive Bayes formula for item-solts and substitute formula (3) and formula (2) into the probability of final P (Category | item). The derivation process is as follows:
Here s is the number of solts, which is a bag of words of specific solts, and the I word in the M solt. The items to be analyzed are only the items associated with the user, that is, each item B user has a corresponding rating score, therefore, we can estimate parameters based on the maximum likelihood of Naive Bayes to determine the parameters and
Symbols and terminology
For each item that the user has rated, because there are only two categories, so the item has two scoring values, remember to do, for the situation you like, for the situation you do not like
The specific values are as follows:
For evaluation, if the score is 1-10 and the score is R ,. Then we can estimate the parameters based on the values. Here we need some knowledge about Naive Bayes parameter estimation, therefore, we recommend that you read the maximum-likelihood estimates for the naive Bayes model section in this article. The result formula is listed only.
N indicates the total number of items, M indicates the number of solts, which is the rating value that the items like or do not like, and the number of times the words in the M solt of the item appear.
The denominator of the above formula (8) may be 0, so it must be avoided using the smooth method. We recommend using the Laplace smoothing coefficient, that is, the numerator + 1, the denominator plus N, finally, we get the preference intensity list of words under the attribute.
The formula for calculating the preference intensity is as follows:
The value of this formula reflects the effect of a word in solt on user preferences. If the formula is greater than 0, users prefer this item, A larger value indicates that the word plays a more active role. Traverse slots and vocabulary V and call this formula to obtain a word preference intensity matrix,
With this matrix, we recommend new items. First, we calculate the number of times each word appears in the new items, multiply the number of times by the intensity in this matrix to obtain the solt's preferred intensity value for the new item, and then add a weight value to each solt according to the actual situation, according to this weight, the user's preference intensity value for the new item is obtained by weighting the value. Generally, this value is displayed to the user in reverse order to complete the recommendation.