One:
Recommendation System tasks: Contact users and information, on the one hand, help users find valuable information on their own, on the other hand, the information can be displayed in the presence of users interested in it, so as to achieve the information consumer and information production in the win.
Long tail Theory: the principle of traditional 80/20 (% 80 sales from 20% popular commodities) has been challenged by internet affiliation. Long tail sales are a number that can be reckoned with, perhaps exceeding the sales of popular goods. Popular goods represent the vast majority of user needs, while long tail goods represent a small number of users personalized needs. Therefore, to explore the long tail to improve sales, we must fully study the interests of users.
Socialization recommendation: Get referrals through social relationships.
Content-based Recommendations: For example, to get a movie from an actor.
Based on collaborative filtering: through the rankings.
Personalized recommendation Success Two conditions: ① exist information overload ② users most of the time without specific requirements.
Recommendation System Evaluation: What is a good recommendation system. A recommendation system typically has three participants: the user, the provider of the goods, and the URL of the referral system. First of all recommend the system to meet the needs of users, to recommend their users interested in items; second, the recommendation system to make each item can be recommended to interested users, rather than just recommend a few popular items; The recommendation system itself can collect high-quality user feedback, and constantly improve the quality of recommendations. Therefore, the evaluation of a recommendation system, the need to consider the interests of the three parties, a good recommendation system can make the three win.
Recommended system Experiment Method:
1 Off-line method: Extracting data from actual system log, dividing training set test set training model.
Advantages: Do not need to have control of the actual system, no user participation, fast, can test a large number of algorithms
Disadvantage: Unable to calculate business indicators of concern. There is a gap between off-line experimental index and commercial index.
2: User survey: That is, directly ask the user. Advantages: Can obtain a lot of user subjective feeling indicators, the disadvantage of recruiting users more difficult to organize large-scale test users, so the test results statistically insignificant.
3: Online Experiment: Recommended system on the line to do AB test, it and the old algorithm to compare. (user groups, different groups using different algorithms).
Advantages: Fair access to different algorithms the actual online performance indicators include commercial concerns.
Disadvantages: Long cycle, it is necessary to conduct long-term experiments to get more reliable results.
Evaluation criteria:
1. Customer Satisfaction
User surveys or online experiments.
2. Forecast Accuracy
In the off-line dataset, the training set and the test set are set up, the user behavior and interest model is established on the training set, the user behavior on the test set is forecasted, and the prediction behavior and the actual behavior coincidence degree of the test set are calculated as the predictive accuracy.
①: Rating prediction: General mean-square error (RMSE) and mean absolute error (MAE) calculation
RMSE:
Recommended system->USERCF Algorithm _ Recommendation System ">
MAE:
Recommended system->USERCF Algorithm _ Recommendation System ">
Import Math
def RMSE (Records): Return
math.sqrt (sum ([(RUI-PUI) * (RUI-PUI) to U,i,rui,pui in Records])/float ( Len (Records))
def MAE (Records): Return
sum ([Math.fabs (RUI-PUI) to U,i,rui,pui in Records])/float (Len ( Records))
②:TOPN Recommendation: The website provides the recommendation service, generally is gives the user a personalized recommendation list, this recommendation is called TOPN recommendation. The accuracy of prediction is measured by the accuracy rate and recall rate.
r (U) is a list of recommendations to users based on the behavior of the user on the training set, and T (U) is the user's lists of behaviors on the test set.
recall Rate Definition:
Recommended system->USERCF Algorithm _ Recommendation System ">
The accuracy rate is defined as:
Recommended system->USERCF Algorithm _ Recommendation System ">
T (U) is the actual list of behaviors, and R (U) is a list of predicted behaviors.
def precisionrecall (test,n):
hit=0
n_recall=0
n_precision=0
'
test.items (): User, The list of the corresponding users and the actual behavior of the user in the items test set
rank is the list of behaviors that the user predicts.
' for
User,items in Test.items ():
rank=recommend (user,n)
Hit+=len (rank&items)
N_ Recall+=len (items)
n_precision+=n return
[hit/(1.0*n_recall), hit/(1.0*n_precision)]
3: Coverage
Describe the ability of a recommendation system to explore the long tail of a product; Define the proportion of items that can be recommended by a recommendation system to the total collection of items.
Recommended system->USERCF Algorithm _ Recommendation System ">
But the definition above is too sketchy. The recommended system with coverage of 100% can have a myriad of items that are prevalent. In order to better explore the long tail ability, we need to statistics the distribution of the number of different items in the recommendation list. Therefore, the ability to explore the long tail can be described by examining the distribution of occurrences in the list of recommended items. There are two indicators that can be used to define coverage.
①: Information entropy:
Recommended system->USERCF Algorithm _ Recommendation System ">
Here P (i) represents the sum of the items I prevalence divided by the prevalence of all items.
②: Gini coefficient (Gini Index):
Recommended system->USERCF Algorithm _ Recommendation System ">
Here IJ represents the J-item in the list of items that are sorted by the popularity of items P () from small to large.
def gini_index (P):
j=1
N=len (p)
g=0
for item,weight in sorted (P.items (), Key=itemgetter (1)):
g+ = (2*j-n-1) *weight return
g/float (n-1)
Matthew Effect: Strong stronger, weaker weak. Determine if the recommendation system has Matthew Effect: If G1 is the Gini coefficient of the item popularity calculated from the initial user behavior, G2 is the Gini coefficient of the item popularity calculated from the recommended list, if G2>G1, it shows that the recommended algorithm has Matthew effect.
4: Multiplicity
Users have a variety of interests, the list of recommendations are more diverse, covering the majority of users interest points, it will increase the user to find the probability of the object of interest.
Diversity and similarity are corresponding. Assume
Defines the similarity between the goods I and the goods J, then the multiplicity of the recommendation lists of the user U is defined as follows:
Recommended system->USERCF Algorithm _ Recommendation System ">
Note: R (u) is the recommended list for user U
The overall diversity of recommended systems can be defined as the average of the diversity of all user referral lists:
Recommended system->USERCF Algorithm _ Recommendation System ">
5: Novelty
The novelty recommendation is to recommend to the user that they have not heard of the item before. , the easiest way to evaluate novelty is to use the average popularity of recommended results, because the less popular items are more likely to make users feel new.
6: The degree of surprise
If the recommendation results are not similar to the user's historical interest, but they are satisfactory to the user, then the recommended results are more pleasantly surprised. The recommended novelty only depends on whether or not the user has heard of the recommended results.
Two:
User behavior data:
The simplest form of existence is the log, which records the various behaviors of the user.
Explicit feedback behavior: The user clearly expresses the preference behavior to the item.
Implicit feedback behavior: those behaviors that do not explicitly respond to user preferences. In contrast to explicit feedback behavior, implicit feedback behavior data volume is larger.
Many times we do not use a unified structure to represent all behaviors, but rather different representations of different behaviors.
①: Implicit feedback DataSet without contextual information: Each record contains only the item ID and the user ID
②: Explicit feedback DataSet without contextual information: Each record contains the item ID and user ID and the user evaluates the item.
③: Hidden feedback DataSet with contextual information: .... The time stamp of the user's behavior on the item.
④: Explicit feedback DataSet with contextual information:
User Behavior Analysis:
Long tail Distribution: the words in a text are arranged according to the number of times they appear (or used) in the text, in R to denote ordinal (also called rank), G (r) indicates the number of occurrences of a word ordinal r in the text, and the product of a power R (β) and G (R) of R is asymptotically a constant, i.e. G (r) *r (β) ≈c. That is, the frequency of each word appears inversely proportional to the constant power of the ordinal number he is sorting.
User behavior data also contains this law: items of high prevalence of goods in the total number of items in the minority, the active degree of users is only a small minority.
User activity and item popularity relationship: The more active the user, the month tends to browse unpopular items.
Collaborative filtering algorithm: a recommendation algorithm based on user behavior data design only.
User-based collaborative filtering algorithm (USERCF): Users are recommended to the user and his interests similar to other users like items.
Object-based Collaborative filtering algorithm (ITEMCF): Recommend items similar to the items he liked before.
The USERCF algorithm mainly consists of two parts:
①: Find a collection of users similar to the target user
②: Find items in this collection that users like, and target users have not heard of, recommend a target user.
22 user similarity is calculated first. The cooperative filtering algorithm mainly uses the similarity degree of behavior to compute the similarity of interest. Given the two user u,v, The N (u) represents the set of items that the user U once made with positive feedback, and N (v) represents the collection of items that the user V has made with positive feedback.
can be computed by cosine similarity:
Recommended system->USERCF Algorithm _ Recommendation System ">
The above user interest similarity calculation is too rough, for example, two users also bought hot items does not mean that they are similar interests, in other words, only to buy the same unpopular items to show that two users of similar interests. Therefore, there are improved versions of the computational similarity:
Recommended system->USERCF Algorithm _ Recommendation System ">
where n (u) represents a list of items with the user U-generated behavior, and N (i) represents a list of users who have acted with the item I.
You can see that the formula passes through
Recommended system->USERCF Algorithm _ Recommendation System ">
Punished the impact of the popularity items on the user U and user v List of common interests on their similarity.
How should the similarity be calculated in the actual calculation?
First set up an item to the user's inverted table, an item may have multiple users with its behavior, for each item is saved to the item generated behavior user list. Assuming that user U and user v belong to the list of users of k items in the inverted table at the same time, then c[u][v]=k can be used to scan the list of users for each item in the inverted table, then calculate the c[u][v], and end up with a c[u][v of 0 between all users.
After the interest similarity between users, the USERCF algorithm will give users the most similar interest to the user to recommend items, the following formula calculates the interest of the user U to the goods I:
Recommended system->USERCF Algorithm _ Recommendation System ">
where S (u,k) represents the most similar to user U K users, N (i) and the list of users of the object I generating behavior. Wuv represents the similarity of user U and user v. RVi represents the user V's interest in item I because of the implicit feedback of a single action, so all rvi=1.
Combat: USERCF Algorithm Implementation code:
Data source
#coding: Utf-8 import random import math from numpy import * Import CSV import datetime numofusers=1000 def GetData (dat afile= ' U.data '): "read out the data in the DataFile file and return it to the:p Aram datafile: Data source file name: Returns: A list, each element is a tuple (userid,m
Ovieid) ' data=[] Try:file=open (datafile) except:print ("No such file name" +datafile)
For line in File:line=line.split (' t ') try:data.append ((int (line[0)), int (line[1))) Except:pass file.close () return Data def splitdata (data,m,k,seed): ' Dividing training sets and test sets:p a RAM data: Incoming:P Aram M: Test set ratio:p Aram K: An arbitrary number used to randomly filter test sets and training sets:p Aram seed: Random number of seeds, in the same case of seed, the resulting random number is unchanged: Retu Rn:train: Set of tests: Test set, all dictionaries, key is User id,value is movie ID collection ' Test=dict () train=dict () Random.seed (Seed) # in M times real Inside we need the same random number of seeds, so that the resulting random sequence is the same for User,item in Data:if random.randint (0,m)!=k: # The probability of equality is 1/m, so M decides
Percentage of test set in all data # choosing different k will select different training sets and test sets if user not in Test.keys (): Test[user]=set () test[user].ad D (item) else:if User not in Train.keys (): Train[user]=set () train[user].a DD (item) return Train,test def Recall (train,test,n,k): ':p Aram Train: Training set:P Aram test: Testing set:p Aram N:TOPN Recommended n number:p Aram K:: Return: Recall "" "hit=0# forecast accurate number totla=0# total number of actions W,relatedusers=improvedc Osinesimilarity (train) for user in Train.keys (): Tu=test[user] Rank=getrecommendation (user,train,n,k,w , relatedusers) for item in Rank:if item in Tu:hit+=1 Totla+=len (TU) retu
RN hit/(totla*1.0) def Precision (train,test,n,k): ':p Aram Train::p Aram Test::p Aram N::p Aram K: : return: "' hit=0 total=0 W, relatedusers = improvedcosinesimilarity (train) for user in train.ke Ys (): Tu = test[User] rank = getrecommendation (user, train, N, K, W, Relatedusers) for item in RANK:IF item I
N Tu:hit + + 1 Total + = N Hit/(Total * 1.0) def Coverage (train,test,n,k): '
Calculation coverage:p Aram Train: Training set dictionary user->items:p Aram Test: Tester dictionary user->items:p Aram N:TOPN recommended N:p Aram K:
: Return: Coverage ' ' Recommend_items=set () All_items=set () w,relatedusers=improvedcosinesimilarity (train) For user in Train.keys (): For item in Train[user]: All_items.add (item) Rank=getrecommendati On (user,train,n,k,w,relatedusers) for item in Rank:recommend_items.add (item) return Len (recommend
_items)/(Len (All_items) *1.0) def popularity (train,test,n,k): ' Calculating average popularity:p Aram Train: Training set dictionary User->items :P Aram Test: Tester dictionary user->items:p Aram N:TOPN recommended N:p Aram K:: Return: Coverage ' Item_popularity=dic T () w,relatedusers=iMprovedcosinesimilarity (train) for User,items in Train.items (): For the item in the Items:if item not in Item_popularity:item_popularity[item]=0 item_popularity[item]+=1 ret=0 n=0 for us
ER in Train.keys (): rank= getrecommendation (user, train, N, K, W, Relatedusers) for item in rank:
If Item!=0:ret+=math.log (1+item_popularity[item]) n+=1 ret/=n*1.0 return ret def cosinesimilarty (train): "The cosine similarity of every two users in the computational training set this function is not practical, complex, and easily out of the Memory, that is, when the training set is large, it is easy to generate out-of-memory Error But this function is relatively easy to see the prototype of the formula, you can use this understanding formula:p Aram Train: Training set, dictionary User->items:return: Return similarity matrix ' W=di
CT () print (Len (Train.keys ()) for u in Train.keys (): For V. in Train.keys (): If U==v: Continue w[(u,v)]=len (Train[u]&train[v]) w[(u,v)]/=math.sqrt (len (train[u)) *len (train[v ]) w[(v) *1.0), u)]=w[(u,v)] return W def improvedcosinesimilarity (train): ' Compute user similarity:p Aram Train:: Returns: Back user similar
Degree Matrix W,w[u][v] represents the similarity of the U,V: return: Returns the user User_relatedusers dictionary, key for the user Id,value and the user has a common movie user collection. ' #建立电影-> User Inverted Table item_user=dict () for U,items in Train.items (): For i in Items:if I No T in Item_user:item_user[i]=set () item_user[i].add (u) #C [U][v] represents a common favorite movie between user U and user v C=z Eros ([numofusers,numofusers],dtype=float16) #N [u] represents the number of films rated by U N=zeros ([numofusers],dtype=int32) # User_relatedus Ers[u] represents the associated user of U (common movie is not zero user) user_relatedusers=dict () # for each movie, put its corresponding user portfolio C[u][v] plus a for item,users in item_user.it
EMS (): For u in users:n[u]+=1 for V in users:if u==v: Continue if u not in User_relatedusers:user_relatedusers[u]=set () u Ser_relatedusers[u].add (v) c[u][v]+ = (1/math.log (1+len (users)) #用户相似度矩阵 W=zeros ([numofusers,numofusers],dtype=float16) for U in range (1,numofuse RS): If u in User_relatedusers:for v. in User_relatedusers[u]: w[u][v]=c[u][v]/sqrt (n[ U]*N[V]) return w,user_relatedusers def recommend (user,train,w,relatedusers,k,n): "By the similarity matrix W get the same rank as the user
Dictionary:p Aram Users: User ID:p Aram Train: Training set:p aram W: Similarity Matrix:p Aram Relatedusers::p Aram K: Determine how much from similar users to calculate :p Aram N:: Return:rank dictionary, containing all films of interest not up to 0, sorted from large to small sort ' rank=dict () for I in Range (1,1700): RA nk[i]=0# I represents the movie ID that the user might like, with an initial interest of 0 k_users=dict () try:for v in Relatedusers[user]: K_users[v] =W[USER][V] except Keyerror:print ("User" +str (user) + doesn ' t have any related users in train set ") K_u
Sers=sorted (K_users.items (), Key=lambda x:x[1],reverse=true) k_users=k_users[0:k] #取前k个用户 for I in Range (1700): For V,wuv in K_userS:if i in train[v] and I is not in Train[user]: #取出被user相似用户v产生行为的电影, at the same time the user does not have a behavior with this movie rank[i]+=w Uv*1 return sorted (Rank.items (), Key=lambda d:d[1],reverse=true) def getrecommendation (user,train,n,k,w,relatedusers ): "Get N recommended:p Aram User: Users:p Aram Train: Training set:p aram W: Similarity Matrix:p Aram N: Recommended number n:p Aram K : Determines how many from similar users to calculate: Return:recommend dictionary, key is movie id,value is interest level ' rank=recommend (user,train,w,relateduse Rs,k,n) recommend=dict () for I in Range (N): Recommend[rank[i][0]]=rank[i][1] return recommend def EV Aluate (train,test,n,k): # #计算一系列评测标准 Recommends=dict () w,relatedusers=improvedcosinesimilarity (train) for
User in Test:recommends[user]=getrecommendation (user,train,n,k,w,relatedusers) Recall=recall (train,test,N,k)
Precision=precision (train,test,n,k) coverage=coverage (train,test,n,k) popularity=popularity (train,test,N,k) Return Recall,precisioN,coverage,popularity def test1 (): Data=getdata () Train,test=splitdata (data,2,1,1) del data user=int (INPU T ("Input the user ID \ n")) print ("The train set contains the movies of the User:") print (Train[user]) n=int (i Nput ("Input the number of recommendations\n")) K=int (Input ("input the number of related users\n") starttime=date Time.datetime.now () w,relatedusers=improvedcosinesimilarity (train) Endtime=datetime.datetime.now () print ("It t Akes ", (endtime-starttime). Seconds," seconds to get W ") Starttime=datetime.datetime.now () Recommend=getrecommenda tion (user,train,n,k,w,relatedusers) endtime=datetime.datetime.now () print ("It Takes", (endtime-starttime). seconds, "Seconds to get recommend for one user") W,relatedusers=improvedcosinesimilarity (train) Recommend=getrecommendat Ion (user,train,n,k,w,relatedusers) print (recommend) for item in Recommend:print (item), if (item in
Test[user]): Print ("True") else:print ("False") def test2 (): N=int (Input ("The number of Recommenda tions: \ n ") K=int (Input (" input the number of related users: \ n ")) data = GetData () train, test = Splitdata (da
TA, 2, 1, 1) del data recall,precision,coverage,popularity=evaluate (train,test,n,k) print ("Recall:", recall) Print ("Precision:", Precision) print ("Coverage:", Coverage) print ("Popularity:", popularity) if __name__== ' __m Ain__ ': Test2 ()