Nonsense:
Recently friends in Learning recommendation system related, said to be the implementation of a complete recommendation system, so we do not one of the three will have some discussion and deduction, think straight-tempered.
In this paper, we start with the process of recommending system in engineering, interspersed some experience, and introduce the newest research progress and genre of the algorithm of recommendation system. Of course, because I do recommend the system is still young, there may be a lot of biased or even wrong views, as a point of view, also ask you a big point.
Reading lists
Although many people feel that as one of the branches of AI, the recommendation is not as difficult as natural language processing problems. But the so-called AX, I think, at least before the commencement of work should read such basic books, at least to look at the catalogue, to the recommendation system has a preliminary understanding.
Chinese books:
1. Recommended system Practice Bright http://book.douban.com/subject/10769749/
This book you say he is good, rather than recommend handbook, you say he is not good, and really put a lot of basic and simple questions are very detailed, but also in Chinese. A thin book, minutes can be turned over, in a word, is a good book to get started.
Foreign Language Books :
1. "recommender Systems Handbook" Paul B. Kantor http://book.douban.com/subject/3695850/
In fact, all the books that dare to call themselves Handbook are God's books. This book is very thin, and very full, if you go to see some of the recommended system of some of the less popular topic paper, such as fighting spam, and even can find a lot of paper is directly quoted in the relevant chapters of the book content. It can be said that this book is to do the recommended classmate Pillow book, did not read the book out of the blow when you are embarrassed to say that you are to do the recommendation, but said really, really read not a few, the general is where to check where, I just went to the watercress verification, a few do recommend is read, a group of literary youth are want to The only drawback of this book is that there is no new version, and some places have become tears of the times.
2. "recommender systems-an Introduction" Dietmar Jannach
http://book.douban.com/subject/5410320/
Similar to the one above, the college-based book, a thick one, does not look when used as a pillow.
3. "Mahout in Action" Sean Owen http://book.douban.com/subject/4893547/
One of the books above is a theoretical basis, and the book is a little bit out of the works. If you want to use the Mahout framework to do the recommendation system, then you have to sweep it, if you do not mahout, look at the process and code organization also has some advantages.
Paper:
Because "Recommender Systems Handbook" Many places have become the tears of the Times, it is recommended to summarize, so as to broaden their horizons.
One is physics report above the recommend system This review, can be said to be the latest and most comprehensive review, after reading the academia are tossing what minute is clear. Http://arxiv.org/pdf/1202.1112v1.pdf
The classic is the state of the recommend system of Art This review, many older comrades like to recommend, said this is because of the age, also has become the tears of the Times.
Start:
The above gave some reading lists, not to say to read all the ability to start, just said, read a directory first, where do not know where to look. In the following introduction of the process of making a recommendation system, I just want to introduce you to a common recommendation system how to do, so many places are lazy, but also please forgive us. And because I am not doing the online recommendation system, but belong to the next day recommended the kind of offline, so the narrative project is only described offline recommendations.
Data preparation:
Generally speaking, the recommendation system is generally divided into two kinds of data, a reading from the online, such as the user to produce a behavior, the recommendation system is the reaction (the legend of the Watercress FM is doing this?) ), there is another way to read from the database.
I personally do this: to do anti-cheating people say hello, let them in the record user behavior data in the way to the various online servers, and then write a PHP script, from the various servers to take the logs I need, and then the latest data on the day to come.
But this kind of place actually stores the data to add some judgment, namely is the classification record (because many records are others brushes, for example loses a link to the QQ group to let others help to vote what), here does not elaborate, to the back fighting spam the place to discuss again.
Data filtering:
When we get the data generated every day, we say that the data is too much, of course we do not have so much, we will write a filter module, some of the data we do not use to filter out.
I usually do this: write a Python script, put the filter into a separate module, the filter to be used in the chain of responsibility to register. So that others and their own maintenance is also convenient point, by the way, the filter is generally there are several kinds of things: one is an item only a user hit the excessive, and no one previously scored, such data put in the recommended model to run although Mahout will automatically ignore it, but in fact, according to the power Law has a lot of, memory can save the bar, there are some blacklist, there is the item and the user's respective blacklist, which is to be removed beforehand.
Data storage:
This is the time for everyone to benevolent see, because in general, you choose the algorithm and algorithm specific implementation and infrastructure determines how you store, not much to say.
I usually do this: the usual way to make incremental processing, and then a daily calculation, one roll a week. Because of the particularity of the algorithm implementation, each of the 40 item user pairs is stored together, somewhat similar to the bitmap bar.
Recommended System Algorithms section:
This part of the previous written similar small records and notes and other things, directly affixed to _ (: З"∠) _
The core algorithms of the recommendation system are mainly implemented with Mahout .
Various algorithms have their own specific assumptions about the recommendation, and the performance should be repeated for assumptions about when and how the algorithm will be better. It's just an experiment.
Then we generally use what algorithm, see Mahout to the algorithm bar, variety, what item Based,user BASED,SLOP-ONE,SVD and so on, often have, then we want to use what algorithm.
First of all, the user based algorithm in the mahout Some of the implementation:
The first step should be to figure out the similarity matrix W for everyone, and then go through the item, in fact Mahout.
The similarity matrix may be preserved so that it can be used for spectral clustering to validate.
Usersimilarity encapsulates the similarity between users
Usersimilarity similarity = new pearsoncorrelationsimilarity (model);
Userneighborhood encapsulates the most similar groups of users
Userneighborhood neighborhood = new Nearestnuserneighborhood (2, similarity, model);
In a word, using Datamodel to generate the data model, using usersimilarity to generate a similarity matrix between user-user, the definition of the user's neighbor is defined by Userneighborhood, and the recommendation engine is implemented using Recommender.
The recommended judgment is to use evaluator to judge
Double score = evaluator.evaluate (Recommenderbuilder, NULL, model, 0.95, 0.05);
Build a model with 95% of data and test with 5% of the data
Fixed-size Neighborhoods
We do not actually have a definite value for how many people we use as a circle around the user, as in a biological experiment we need to constantly run out of a particular data set.
Threshold-based Neighborhood
Of course, we can also define a threshold to find the most similar user groups.
threshold is defined as 1 to 1 (the similarity returned by the similarity matrix is in this range)
New Thresholduserneighborhood (0.7, similarity, model)
We make a simple ratio (spit) to each algorithm (slot):
(Let's say we're going to recommend a product like Amazon, and then we'll end up with top K recommendations)
Item based
Generally speaking, item-based to run faster because item is less than user
Slop One
To be honest, I think this algorithm is not very reliable for something that is more important to personal tastes.
But it's grouplens the result of the million data is 0.65.
Of course, this is still desirable for the stock system.
The hypothesis of this algorithm is that there is a linear relationship between the preference of different item
But the advantage of Slope-one is that its online calculation is fast and its performance is not determined by the number of users
The call method in Mahout is new Slopeonerecommender (model)
This method provides two types of weight:weighting based on count and on the standard deviation
Count is the more the user gives the more weight, the more weighted the average
Standard deviation is the lower the deviation gives the higher weight
These two weight are used by default, of course disable they only make the result slightly worse 0.67
However, the obvious disadvantage of this algorithm is that it accounts for more memory
But luckily we can put it in the database: Mysqljdbcdatamodel
Singular value decomposition–based recommenders
In fact, although SVD loses some information, it can sometimes improve the recommended results.
This process smoothed the input in a useful way
New Svdrecommender (model, new Alswrfactorizer (model, 10, 0.05, 10))
The first parameter, 10, is the number of our target properties.
The second property is a lambda->regularization
The last parameter is the number of training step runs
Knnitembasedrecommender
Embarrassed, in fact this is used in the KNN way to do the algorithm, and the front of the selection of a threshold and then circle the user's algorithm is more like the
But the cost of KNN is very high, because it's going to compare all the items
Itemsimilarity similarity = new loglikelihoodsimilarity (model);
Optimizer Optimizer = new Nonnegativequadraticoptimizer ();
return new Knnitembasedrecommender (model, similarity, optimizer, 10);
The result is not bad, it's 0.76.
Cluster-based recommendation
Clustering-based recommendations can be said to be the best idea based on the variant of the user-recommended algorithm
Recommended for every user in a cluster
This algorithm is very fast in the recommendation, because everything is well-calculated beforehand.
This algorithm is pretty good for cold start.
Do you feel that the clustering algorithm used in Mahout should resemble Kmeans?
Treeclusteringrecommender
Usersimilarity similarity = new loglikelihoodsimilarity (model);
Clustersimilarity clustersimilarity =
New farthestneighborclustersimilarity (similarity);
return new Treeclusteringrecommender (model, clustersimilarity, 10);
Note that the similarity between the two cluster is defined by the clustersimilarity.
The similarity between cluster and nearestneighborclustersimilarity is optional.
Spit Groove:
For the selection of algorithms, we are actually going to hook up with the target we want to recommend. Why the recent academia to SVD that system of algorithms so fire, what lda,plsa various algorithms, in fact, because Netflix's requirements are to optimize the RMSE, in the machine learning perspective, similar to the regression problem, and industrial perspective, our general demand is to do top K recommendation , more similar to classification problems. So why compared with the SVD series of algorithms, with the item based this more ad hoc algorithm to perform better. Of course 2012 years KDD Cup first group with the item BASED+SVD algorithm, this is something.
Then I assume that the solution to our top K product recommendation problem with the item based good (fast, the result is not bad), the next is to determine the similarity degree.
Similarity determination:
We do a bit more than (spit) for each similarity (groove):
Pearsoncorrelationsimilarity
Pearson Correlation:
Coeff = Corr (X, Y);
function Coeff = Mypearson (X, Y)
% This function realizes the calculation operation of Pearson correlation coefficient.
%
Input
% X: Numeric sequence of inputs
% Y: The numeric sequence of the input
%
Output
% Coeff: Two input numeric sequence correlation coefficient of x, y
%
If Length (X) ~= Length (Y)
Error (' The dimensionality of the two numeric series is not equal ');
Return
End
Fenzi = SUM (x. * Y)-(SUM (x) * SUM (Y))/length (x);
Fenmu = sqrt ((sum (x. ^2)-sum (x) ^2/length (x)) * (Sum (y. ^2)-sum (y) ^2/length (x)));
Coeff = FENZI/FENMU;
End% function Mypearson ends
When the standard deviation of two variables is not zero, the correlation coefficients are defined, and the Pearson correlation coefficient applies To:
(1), two variables are linear relations, are continuous data.
(2), two variables are generally normal, or nearly normal single-peak distribution.
The observed values of (3) and two variables are paired, and each pair of observations is independent of each other.
1. The number of items that are not considered coincident with user preferences
2, only one item is interleaved and cannot be correlation, which is a problem to be aware of when comparing sparse or small datasets. In general, however, only one item overlap between two users is intuitively less similar.
Pearson correlation generally appeared in the early recommended papers and recommended books, but not always good.
An increased parameter weighting.weighted is used in mahout, which can be used to improve the recommended results.
Euclideandistancesimilarity
Return 1/(1 + D)
Cosinemeasuresimilarity
When both series of input values have a mean of 0 (centered) and pearsoncorrelation are the same result
So in mahout, we just need to use pearsoncorrelationsimilarity simply.
Spearman Correlation
This method uses rank, although it loses the specific scoring information, but retains the item's order
The result of return is 1 and 12 values, but like Pearson, there is no overlap for only one item.
And the algorithm is slow because it calculates and stores rank information. So paper more and actually use less, for small data sets is worth considering
Cachingusersimilarity
Usersimilarity similarity = new Cachingusersimilarity (
New Spearmancorrelationsimilarity (model), model);
Ignoring preference values in similarity with the Tanimoto coefficient
Tanimotocoefficientsimilarity
If there is preference value in the beginning, this method can be used when the data signal is more than noise.
But generally the results of preference information are better.
Log-likelihood
Log-likelihood try to access how unlikely these overlapping parts are coincidences
The values of the results can be interpreted as their coincident parts are not coincidental concepts
The result of the algorithm may be better than Tanimoto, a smarter metric
Inferring Preferences
For data with a smaller amount of data, Pearson is difficult to handle, such as a user-only express one preference
So you have to estimate the similarity ...
Averagingpreferenceinferrer
Setpreferenceinferrer ().
However, this method is not useful in practice, but in the early paper mention to
Estimate the current information to not add anything, and greatly reduce the computational speed
Finally, we want to compare the above similarity by experiment, in general, with the accuracy rate, recall rate, coverage evaluation.
Here is an article Building industrial-scale real-world recommender Systems
http://vdisk.weibo.com/s/rmsj-
Write NetFlex, very good, I will not caught dead more said, so above just spit groove under common algorithm and similarity degree.
In fact, the algorithm by genre is divided into the following categories, we are interested can also understand the next, I do not do more introduction:
Similarity-based methods
dimensionality Reduction Techniques
Dimensionality-based methods
Diffusion-based methods
Social fltering
Meta approaches
The similarity and recommendation algorithm I mentioned above can only be regarded as a very small fraction of similarity-based methods and dimensionality Reduction techniques.
Ps: Just asked on the watercress, they said to use the first two, in fact, I also think that the co-filtering +SVD occasionally do topic model is enough if nothing to dry again on the point social trusted things
Add rule:
Remember that Hulu said when he was doing presentation, "Can't make a custom recommendation system is not a good recommendation system" (like this ...) In fact, the result of our recommendation is that we need to do a lot of processing to show the user, and this is the time to add the rules.
1. Improve the recommended effect: some collaborative filtering algorithm, the results will inevitably produce some ironic results, such as users buy everything, resulting in you may be recommended with the girls in the Buddha recommended swimwear or something (true story). At this time there are two ways, one is to adjust the model, one is to increase the rules to make certain restrictions. Another common example is the occasional summer wear when recommending winter clothes, and this is usually my way of doing a time-of-day recession for this seasonal product.
2. Increased advertising and orientation: Insert ads, our favorites, this is not much to say, by the rules. And so-called users like, in fact, not necessarily the best, such as the user generally like cheap, and what the Korean flow explosion money, if you push out the things are like this, the whole system becomes a washing and shearing blowing large set, very influence positioning, this time also to the rules.
3. Do some data mining and fighting spam work: This place in fighting spam
Visual parameter adjustment:
Finish the above work, generally recommend the system infrastructure is almost, but often each algorithm and your own rules have a plethora of parameters to adjust, this time generally to write a test script, the results of the adjustment to visualize the next, I personally recommend the Highchart, Look at the parameters and compare the indicators are very refreshing, but also can do some of their own, such as the log and other custom, is very convenient. http://www.highcharts.com/
Adjust parameters and go live:
There are two things to do before you go online, typically offline testing and AB test.
Offline testing is part of the data sampling, divided into train and test, and then evaluate some accuracy, recall rate and coverage of the indicators, with the above visual tools to observe the comparison, feel almost the PM called to let her to the small members to see, Look at the aesthetic and effect of the like. This is relatively rough, and some places to do relatively fine, there are user research and please some people to actually use, this is something.
AB test is also the favorite place for everyone. Because honestly, the evaluation recommendation system academia is looking at the accuracy of those things, but the industry still look at the PV UV conversion rate of this real deal bring benefits, and AB test is to evaluate these. I personally is more recommend this method, not good, just to start, is according to the general practice first empty run one weeks, and then put the system to do the actual algorithm PK, and then select the experimental user half the chance to enter the original algorithm recommendation, half the chance to enter your algorithm recommendation, every day to see the comparison between conversion rate, By the way can also adjust the parameters to do the experiment. If the algorithm is stable and good, it's almost there.
Fighting spam:
As the saying goes, there are places where there are rivers and lakes, there are recommended places for someone to brush. Brushes are generally divided into three types: Average random and nuke. Generally speaking, average and random are better to deal with, as long as your system robustness better, the basic impact is small. But Nuke is very annoying, in general there are two ways to solve, one is to improve the system robustness, the other is the rule. Let's take a look at the solution of the two ideas distribution from this graph:
In fact, the robustness of the system is to reduce the efficient attack curve, in fact, the effect is not too good.
The rule is to do early detection, to kill the danger in the cradle, is to do the blue part of the piece of detectable.
Fighting spam is a profound problem, as the saying goes, fighting with people, fun, that is to say this meaning.
From the rules, generally speaking, the rules can be placed in the first stage of data collection and filtering, such as when collecting data to see if the person is multiple accounts but is an IP, or some people user name or registered mailbox has a group similar nature, or there is no abnormal PV and other situations. Sometimes we can also from the recommendations of the results of the rules to check, for example, some people brush too much, leading to some problems in the recommendation results, we can use an iterative manner according to the proportion of the prior brush to discharge the brush list and other things. These are the experience of experience, not on the table, we can also explore their own.
End:
Above the long-winded, generally to do a complete simple referral system is related to the above steps. Some of the places are not detailed enough, some are because I am lazy, some are inconvenient to say. People think I said the wrong or inappropriate place, you can directly in the following thread spray, or have greatly guided me is also very welcome, we have more exchanges of experience. My mailbox is [email protected] watercress is http://www.douban.com/people/45119625/have any problem can also watercress or mail exchange.
In fact, I think, if there is a big say "you silly lack, write Dog, let me teach you how I do recommend" on the better.
"Turn" to write a recommendation system