1: Ways to contact users ' interests and items
2: Typical representative of the label system
3: How users tag
4: Label-based recommender system
5: Improvement of the algorithm
Source Code View address: GitHub view
A: How to contact users ' interests and items
The purpose of the referral system is to contact the user's interests and items, which need to rely on different mediums. The popular recommendation system is basically three ways of contacting users ' interests and items.
1: Make use of the items that the user likes, give the user to recommend the item similar to the item that he likes, namely the system filter recommendation algorithm based on item (algorithm analysis can refer to: Click to read)
2: Use users and interests of users with similar interests of other users, to recommend to users and their interests similar to other users like the items, that is, user-based collaborative filtering recommendation algorithm (algorithm analysis can be consulted: click to read)
3: Through a number of features to contact users and items, to recommend those with user-liked characteristics of the items, where the characteristics of different forms of expression, such as can be expressed as the attributes of the object set, can also be expressed as a semantic vector, and the following we will discuss is an important feature of the form-label
Two: Typical representative of label system
Pass off those foreign websites, such as watercress books (left), NetEase Cloud Music (right)
The tagging system does help users find things they like and are interested in.
Three: How users tag
In the Internet, everyone's behavior is random, but in fact, these surface behavior hides a lot of laws, then we have the user to hit the tag statistics, then introduced the label popularity, we define a label by a user on an item, his popularity is added 1, can be implemented as follows code:
#统计标签流行度def tagpopularity (Records): tagfreq = Dict () for user, item, tag in records: if tag not in Tagfreq:
tagfreq[tag] = 1 else: Tagfreq[tag] +=1 return tagfreq
The following is a label popularity map, which is also in line with the typical long tail distribution, his double logarithmic curve is almost a straight line
When the user sees an item, we want him to hit the label is able to accurately describe the content attributes of the key words, but the user is often not in accordance with our ideas, but may be a variety of bizarre labels, we need to manually edit some specific tags for the user to choose, Scott A. Golder summarizes the labels on the delicious and divides them into the following categories.
Indicate what the item is, such as a bird, there will be a label for the word "bird"; the homepage of the watercress, there is a label called "watercress"; it's Steve Jobs's homepage, and there's a label called "Jobs."
indicate the type of items, such as in Delicious's bookmarks, which indicate a category of pages including article (article), blog (blog), book, etc.
shows who owns the items, such as many blog tags, including the blog author and other information.
Expressing users ' views, such as users think that the Web is very interesting, will be tagged funny (interesting), think very boring, will be labeled boring (boring).
User-related tags such as my favorite (my favorite), my comment (my comments), etc.
User tasks such as to read (about to read), Job search (Find a job)
For example, in the watercress, the label will be divided into the following categories
four: Label-based recommender system
The user uses the label to describe the view to the item, therefore the label is the link which connects the user and the item, is also the important data source which responds the user interest, how uses the user's label data to improve the personalized recommendation result quality is the recommendation System Research important topic.
The watercress takes good advantage of the label data, which incorporates the labeling system into the entire product line.
First, on each page of the book, Watercress provides an application called "Watercress member common label", which gives the user the most commonly used tags in this book.
At the same time, in the user to the book evaluation, watercress will also allow users to tag books.
Finally, in the final personalized recommendation results, the watercress uses the label to make a cluster of user recommendations, showing the user's recommendations under different labels, thereby increasing the diversity and explanatory nature of the recommendations.
A data set for the behavior of a user tag is typically represented by a set of triples, where the record (U, I, B) indicates that the user U gave the item I a label B. Of course, the user's real tag behavior data is far more complex than the ternary group, such as the user to tag the time, the user's property data, property data and other items. But in order to focus on the label data, consider only the ternary form of data defined above, that is, each user's tag behavior is represented by a ternary group (user, item, tag).
1: Test setup
This section randomly divides the data set into 10 copies. The key values that are split here are users and items, not including tags. In other words, the user's multiple label records for an item are either split into the training set, or they are split into the test set, not part of the training set, and the other part in the test set. We then selected 1 copies as the test set, the remaining 9 as a training set, and the user tag data from the learning training set to predict what items the user would label on the test sets. for user u, make r (U) a recommended list of length n for user u, which contains items that we think users will tag. The T (u) is a collection of items that the user u actually tagged in the test set. We then use the accuracy rate (precision) and recall rate (recall) to evaluate the precision of the personalized recommendation algorithm.
Take the above experiment 10 times, choose a different test set each time, and then use the accuracy of each experiment and the average recall rate as the final evaluation result. In order to comprehensively evaluate the performance of personalized recommendations, we also evaluated the coverage (coverage), diversity (diversity) and novelty of the recommended results. The calculation formula for coverage is as follows:
Next we use the cosine similarity of the item tag vectors to measure the similarity between items. For each item I,item_tags[i] stores the label vector for item I, where Item_tags[i][b] is the number of times the item I is labeled B, then the cosine similarity of the items I and J can be calculated by the following procedure.
#计算余弦相似度def Cosinesim (item_tags,i,j): ret = 0 for B,WIB in Item_tags[i].items (): #求物品i, J label intersection number if b In Item_tags[j]: ret + = WIB * Item_tags[j][b] ni = 0 NJ = 0 for b, W in Item_tags[i].items (): #统计 I Number of labels ni + = w * W for B, W in Item_tags[j].items (): Number of tags #统计 j NJ + w * W if ret = = 0: retu RN 0 return ret/math.sqrt (NI * nj) #返回余弦值
After we get the similarity measure between items, we can calculate the multiplicity of a recommendation list using the following formula:
The Python implementation is:
#计算推荐列表多样性def diversity (Item_tags,recommend_items): ret = 0 n = 0 for i in Recommend_items.keys (): for J in Recommend_items.keys (): if i = = J: continue ret + cosinesim (item_tags,i,j) n + = 1 return ret /(n * 1.0)
The diversity of recommended systems is the average of the diversity of recommended lists for all users.
As for the novelty of the recommendation results, we simply measured the average popularity (averagepopularity) of the recommended results. For item I, define its popularity item_pop (i) the number of users who have tagged this item. For the recommender system, we define its average popularity as follows:
2: A simple algorithm
Get the user tag behavior data, I believe you can think of one of the simplest personalized recommendation algorithm. of this algorithm
The description is shown below.
The most commonly used tags for each user are counted.
for each label, count the items that have been hit this most frequently.
for a user, first find his usual tags, and then find the most popular items with these tags recommended to this user.
For the above algorithm, user U's interest formula for item I is as follows:
Here, B (u) is the user U hit a set of tags, b (i) is the item I was hit a set of tags, nu,b is the user u hit tag b number of times, Nb,i is the item I was hit the number of label B. This chapter marks this algorithm with simpletagbased.
In Python, we follow the following conventions:
A ternary group that stores label data with records, where records[i] = [user, item, tag];
Storage nu,b with User_tags, wherein user_tags[u][b] = nu,b;
Store nb,i with Tag_items, where tag_items[b][i] = nb,i.
The following programs can count User_tags and Tag_items from records:
<span style= "Font-family:microsoft Yahei;" > #从records中统计出user_tags和tag_itemsdef Initstat (Records): user_tags = Dict () tag_items = Dict () user_ Items = Dict () for user, item, tag in Records.items (): Addvaluetomat (user_tags, user, tag, 1) Addvaluetomat ( Tag_items, Tag, item, 1) Addvaluetomat (user_items, User, item, 1) </span>
after User_tags and Tag_items are counted, we can personalize the user recommendation through the following procedure:
<span style= "Font-family:microsoft Yahei;" > #对用户进行个性化推荐def recommend (user): recommend_items = Dict () tagged_items = User_items[user] for tag, wut In User_tags[user].items (): for item, WTI in Tag_items[tag].items (): #if items has been tagged, do not recommend them if item in Tagged_items: continue If Item not in Recommend_items: Recommend_items[item] = wut * wti< C9/>else: Recommend_items[item] + = Wut * WTI return recommend_items</span>
Five:improvement of the algorithm
Re-review the simple algorithm proposed in four
The algorithm has many shortcomings, such as the processing of hot goods, data processing, and so on, which is often encountered in the recommendation system problems
1:TF-IDF
The previous formula tends to give a big weight to popular items, so it will result in recommending popular items to the user, thus reducing the novelty of the recommended results. In addition, this formula uses the user's tag vectors to model user interest, where each tag is a user-used tag, and the weight of the label is the number of times the user has used the tag. The disadvantage of this method of modeling is to give the hot label too much weight, so as not to reflect the user's personalized interest. Here we can learn from TF-IDF's ideas and improve on this formula:
Here, it is recorded how many different users have used the label B. This algorithm is recorded as TAGBASEDTFIDF.
In the same vein, we can use TF-IDF's ideas to punish popular items to get the following formula:
among them, recorded the item I was labeled by how many different users. This algorithm is recorded as tagbasedtfidf++.
2: sparsity of data In the previous algorithm, the user interests and items are connected through the intersection of B (u) b (i) , but for the new user, the result of this intersection will be very small, in order to improve the reliability of the recommendation results, here we have to expand the label, such as if users have used "recommended system "This tag, we can also add the label's similar tag to the user tag collection, such as" personalization "," collaborative filtering "and other tags. There are many ways to label extensions, such as topic models (refer to blogs), which follow a simple principle to introduce a neighborhood-based approach. the essence of label expansion is to find a tag similar to his, that is, to calculate the similarity between the tags. The simplest similarity can be a synonym. If you have a synonym dictionary, you can expand the label based on the dictionary. Without this dictionary, we can count the similarity of the labels from the data. If you think that a different label on the same item has some similarity, then when two tags appear in the tag set of many items, we can assume that the two tags have a greater similarity. For label B, where N (b) is a set of items with label B, n_{b,i} is the number of users who label B for item I, we can calculate the similarity of label B and label B ' with the following cosine similarity formula:
3: Label cleanup Not all tags can reflect the user's interest. For example, in a video site, the user may be a video to show a mood of the label, such as "not funny", but we do not think users are "not funny" is interested in, and give users to recommend other "not funny" the label of the video. Conversely, if the user has hit the video "Jackie" This label, we can think of the user is interested in Chan's film, so as to recommend the user into the other movies. At the same time, the label system often appears in different forms, the same meaning of the label, such as Recommender system and recommendation engine is two synonyms.
Another important significance of label cleanup is to interpret the labels as recommendations. If we want to present the label to the user as an explanation for recommending an item to the user, the quality requirement for the label is very high. First of all, these tags do not contain meaningless stop words or words that express emotions, and secondly, these recommendations can not contain many meanings of the same words.
In general, there are the following label cleanup methods:
Remove the stop words with a high frequency;
Remove synonyms due to root differences, such as Recommender system and recommendation system;
Remove synonyms caused by delimiters, such as collaborative_filtering and collaborative-filtering. In order to control the quality of the label, many Web sites also use the idea of giving users feedback, that is, let the user tell the system a label is appropriate.
Referral program (update view address: g ithub):
#!/usr/bin/env python#-*-coding:utf-8-*-import Random #统计各类数量 def addvaluetomat (THEMAT,KEY,VALUE,INCR): if Key not In Themat: #如果key没出先在theMat中 themat[key]=dict (); THEMAT[KEY][VALUE]=INCR; Else:if value not in Themat[key]: THEMAT[KEY][VALUE]=INCR; ELSE:THEMAT[KEY][VALUE]+=INCR; #若有值, increment user_tags = Dict (); Tag_items = Dict (); User_items = Dict (); User_items_test = Dict (); #测试集数据字典 #初始化, perform various statistics def INITSTAT (): data_file = open (' delicious.dat ') line = Data_ File.readline (); While Line:if Random.random () >0.1: #将90% of the data as the training set, the remaining 10% of the data as the test set terms = line.split ("\ t"); #训练集的数据结构 is [user, item, tag] form user=terms[0]; ITEM=TERMS[1]; TAG=TERMS[2]; Addvaluetomat (user_tags,user,tag,1) Addvaluetomat (tag_items,tag,item,1) Addvaluetomat (user_items , user,item,1) line = Data_file.readline (); Else:addvaluetomat (user_items_test,user,item,1) data_file.close (); #推荐算法 def recommend (usr): recommend_list = Dict (); Tagged_item = user_items[usr]; #得到该用户所有推荐过的物品 for Tag_,wut in User_tags[usr].items (): #用户打过的标签及次数 for Item_,wi T in Tag_items[tag_].items (): #物品被打过的标签及被打过的次数 if Item_ not in Tagged_item: #已经推荐过的不再推荐 if ite M_ not in Recommend_list:recommend_list[item_]=wut*wit; #根据公式 else: Recommend_list[item_]+=wut*wit; Return sorted (Recommend_list.iteritems (), Key=lambda a:a[1],reverse=true) initstat () Recommend_list = recommend (" 48411 ") # print Recommend_listfor recommend in Recommend_list[:10]: #兴趣度最高的十个itemid print recommend
Operation Result:
(' 912 ', 610)
(' 3763 ', 394)
(' 52503 ', 238)
(' 39051 ', 154)
(' 45647 ', 147)
(' 21832 ', 144)
(' 1963 ', 143)
(' 1237 ', 140)
(' 33815 ', 140)
(' 5136 ', 138)
Recommended system based on label user recommendation system