Starting from this article, I plan to learn and write something about data mining, mainly to urge myself to learn and summarize.
I first purchased a "Unix/Linux Programming Practice tutorial" from China-Pub online shopping. The book is good, cheap, and can be paid on delivery. It attracted me to be a lazy student, as a result, I had to buy a lot of books, and then switched to Dangdang, and then to Amazon. Now I have basically bought JD. In addition to books, I will also buy some daily necessities in Yihaodian. Later, I found that many of them will have recommendations. Dangdang and China-Pub have no idea about the recommendations. JD recommendations are far from my point of interest, but Amazon is impressed, the emails pushed once have been pushed to my heart. This time I am also playing with a puppet recommendation system.
Okay, I'm bored tonight. I want to find a book to entertain myself. Now there are a lot of books. Which one should I buy?
1) I will call a friend to ask for recommendations;
2) log on to the website to view the top best-selling websites and select one that you are interested in;
The problem will be discovered gradually. First, the recommended books and the top books are relatively stable and there are not many new ideas. After reading them, they will be gone. Second, there are not many books that match your temper, I am very tired of books from giants or schools, and I am also very tired of the top100 many breeding and health care projects.
Well, now we are talking about big data. How can we use big data for recommendation? First of all, I think that I like the Unix/Linux Programming Practice tutorial, which is similar to the writing style of this content. I like a book from packt Publishing House, maybe I like other books published by this publishing house (fact I like this publishing house very much) And I like think
In C ++, I may also like other books written by Bruce Eckel (the same is true ). In another situation, I am a conon, and I have a conon friend. I like both of them. His favorite books may also be my favorite books, my favorite book may also be his favorite book (the same is true ). Well, to sum up, the first is item-based recommendation. I bought a, which is likely to like B like A. The second is user-based recommendation. I and C share the same interests, he likes D, and I may also like D.
Let's talk about comments first. There are two types of comments first thought of: buy or not buy (1 and 0); buy a few stars (usually five stars, 4, 5), there are also allowed to play half a star, so there is (0,
0.5, 1, 1.5, 2, 2.5... 5 ). In the following example, we use a decimal number between 0 and 5 to indicate a higher rating. For example, if I like a, I give him 5 stars, I usually like B, and I give him 3.8 stars, I hate C very much. I gave him 0 stars, so I had the following Variable ctitics:
critics = {'user1': {'goods1': 2.5, 'goods2': 3.5, 'goods3': 3.0, 'goods4': 3.5, 'goods5': 2.5, 'goods6': 3.0},'user2': {'goods1': 3.0, 'goods2': 3.5, 'goods3': 1.5, 'goods4': 5.0, 'goods6': 3.0, 'goods5': 3.5}, 'user3': {'goods1': 2.5, 'goods2': 3.0, 'goods4': 3.5, 'goods6': 4.0},'user4': {'goods2': 3.5, 'goods3': 3.0, 'goods6': 4.5, 'goods4': 4.0, 'goods5': 2.5},'user5': {'goods1': 3.0, 'goods2': 4.0, 'goods3': 2.0, 'goods4': 3.0, 'goods6': 3.0, 'goods5': 2.0}, 'user6': {'goods1': 3.0, 'goods2': 4.0, 'goods6': 3.0, 'goods4': 5.0, 'goods5': 3.5},'user7': {'goods2': 4.5, 'goods5': 1.0, 'goods4': 4.0}}
The following defines "similarity". "similarity" is very similar in some aspects. How can we consider similarity? How can we determine that user1 is similar to other user * in the preceding variable ctitics, then let us give recommendations. The first thing we think of is Euclidean distance evaluation. This is the simplest formula for finding two points of distance,
The "similarity" between any two users is calculated as follows ":
from math import sqrtdef sim_distance(prefs,person1,person2): si={} for item in prefs[person1]: if item in prefs[person2]: si[item]=1 if len(si)==0: return 0 sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2) for item in prefs[person1] if item in prefs[person2]]) return 1/(1+sqrt(sum_of_squares))
Note that in the last line, according to the basic Euclidean distance, the closer the two persons, the smaller the distance value. Generally, the closer the degree, the larger the value is given, therefore, this correction always returns a value between 0 and 1, indicating that the two living places have the same interests on the right.
Well, some of us will always be very picky, and some people will always be less picky. These picky people tend to give overall low comments (, 1 ), some less picky people tend to give an overall high rating (5, 4, 3), but the overall preferences of the two are similar, goods1 will be given a number of Stars (3 and 5) that they think are higher, and goods2 will be given a number of Stars (1 and 3) that they think are poorer ). In this case, the results calculated using Euclidean distance are not so competitive. How can we correct this deviation? Let's take a look at Pearson correlation evaluation.
The Pearson correlation between users is calculated as follows:
def sim_pearson(prefs,p1,p2): si={} for item in prefs[p1]: if item in prefs[p2]: si[item]=1 if len(si)==0: return 0 n=len(si) sum1=sum([prefs[p1][it] for it in si]) sum2=sum([prefs[p2][it] for it in si]) sum1Sq=sum([pow(prefs[p1][it],2) for it in si]) sum2Sq=sum([pow(prefs[p2][it],2) for it in si]) pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si]) num=pSum-(sum1*sum2/n) den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n)) if den==0: return 0 r=num/den return r
There are still many ways to calculate similarity, so you can study it later.
Now we can determine the similarity of any user:
def topMatches(prefs,person,n=5,similarity=sim_pearson): scores=[(similarity(prefs,person,other),other) for other in prefs if other!=person] scores.sort() scores.reverse() return scores[0:n]
Note that the last parameter of topmatches is a function name, which is any method of similarity calculation described above, as long as they have the same function signature, in this way, we can use the similarity calculation method we want at any time.
Let's take a look at the results calculated on my machine:
>>> import test>>> test.topMatches(test.critics, 'user1')[(0.9912407071619299, 'user7'), (0.7470178808339965, 'user6'), (0.5940885257860044, 'user5'), (0.5669467095138396, 'user4'), (0.40451991747794525, 'user3')]>>> test.topMatches(test.critics, 'user2')[(0.963795681875635, 'user6'), (0.41176470588235276, 'user5'), (0.39605901719066977, 'user1'), (0.38124642583151164, 'user7'), (0.31497039417435607, 'user4')]>>> test.topMatches(test.critics, 'user3')[(1.0, 'user4'), (0.40451991747794525, 'user1'), (0.20459830184114206, 'user2'), (0.13483997249264842, 'user6'), (-0.2581988897471611, 'user5')]>>> test.topMatches(test.critics, 'user4')[(1.0, 'user3'), (0.8934051474415647, 'user7'), (0.5669467095138411, 'user5'), (0.5669467095138396, 'user1'), (0.31497039417435607, 'user2')]>>> test.topMatches(test.critics, 'user5')[(0.9244734516419049, 'user7'), (0.5940885257860044, 'user1'), (0.5669467095138411, 'user4'), (0.41176470588235276, 'user2'), (0.21128856368212925, 'user6')]>>> test.topMatches(test.critics, 'user6')[(0.963795681875635, 'user2'), (0.7470178808339965, 'user1'), (0.66284898035987, 'user7'), (0.21128856368212925, 'user5'), (0.13483997249264842, 'user3')]>>> test.topMatches(test.critics, 'user7')[(0.9912407071619299, 'user1'), (0.9244734516419049, 'user5'), (0.8934051474415647, 'user4'), (0.66284898035987, 'user6'), (0.38124642583151164, 'user2')]
Now, I finally found a comrade with a bad taste and gave a suggestion?
(1) We can find a goods that he thinks highly of and has never seen,
(2) Among all like-minded users, we use a weighted evaluation value of similarity and evaluation to score googs, so as to form a ranking for further recommendation.
Obviously, we adopt the second method. Therefore, we need to obtain the ranking we want after obtaining the similarity between other reviewers and multiplying the comments into the value of each goods.
There is a problem in the middle, that is, if a goods user has a lot of comments, the ranking will be relatively high, and the final influence of goods with few comments will be relatively small, normally, there is no problem. Here we will make some corrections, that is, we will divide the sum of the similarity of all other users evaluated for this goods by the final ranking value. This will be fair, hahaha, let's look at the Code:
def getRecommendations(prefs,person,similarity=sim_pearson): totals={} simSums={} for other in prefs: if other==person: continue sim=similarity(prefs,person,other) if sim<=0: continue for item in prefs[other]: if item not in prefs[person] or prefs[person][item]==0: totals.setdefault(item,0) totals[item]+=prefs[other][item]*sim simSums.setdefault(item,0) simSums[item]+=sim rankings=[(total/simSums[item],item) for item,total in totals.items()] rankings.sort() rankings.reverse() return rankings
Next, we can make recommendations to see my computing results. User1, user2, and user5 evaluate all goods, so no new recommendations are provided.
>>> import test>>> test.getRecommendations(test.critics, 'user1')[]>>> test.getRecommendations(test.critics, 'user2')[]>>> test.getRecommendations(test.critics, 'user3')[(2.8092760065251268, 'goods3'), (2.694636703980363, 'goods5')]>>> test.getRecommendations(test.critics, 'user4')[(2.683756272799255, 'goods1')]>>> test.getRecommendations(test.critics, 'user5')[]>>> test.getRecommendations(test.critics, 'user6')[(2.1505590044630245, 'goods3')]>>> test.getRecommendations(test.critics, 'user7')[(3.3477895267131013, 'goods6'), (2.832549918264162, 'goods1'), (2.5309807037655645, 'goods3')]
We have not only calculated the similarity, but also provided recommendations. It is worth mentioning that similarity calculation can not only calculate the similarity between users, but also between goods, items with high similarity based on user-purchased items are also common in JD and Amazon.