The recommendation system was first applied on Amazon's website. Based on previous user Buying Behaviors, we recommend other products that can be purchased at the same time when purchasing a product. Dangdang, which is a good solution in China, sometimes buys books, it can always recommend other books that I am interested in, which is a technology that greatly promotes sales.
A general collaborative filtering algorithm first collects users' scores on things (products). One is to score a book or a song directly, and the other is a recessive score, for example, a table is purchased in a business system.
2 points, 1 point for browsing, and 0 points for others. I am optimistic about the implicit score, because the direct score requires a high degree of user participation. Many websites leave a score button on the Content Page, from 1 ~ 5. select one.
I may like this article, but where do I know how much I like? I also want to think about it. A very important principle in website design is: do not let me
Think !, Therefore, I would like to give a score or not, but the implicit score is different. You will only buy the books you like, and you will only listen to the songs you like multiple times.
After collecting users' scores, you can use the nearest neighbor to search for other things or people with similar features or interests. The nearest neighbor search algorithm is generally person correlation coefficient) cosine-based similarity and the cosine similarity adjustment (adjusted cosine similarity ). The Application of cosine theorem in data mining has been introduced in the Google black and white paper. You can refer to the 12-cosine theorem of the beautiful mathematical series and the classification of news.
The rest of the work is to make recommendations based on the nearest neighbor set.
The calculation of the nearest neighbor set is relatively costly, especially when there is a large amount of data. Today we will share with you a simple and efficient collaborative filtering algorithm: slope one.
Basic Principles
User |
Score things |
Score transaction B |
X |
3 |
4 |
Y |
2 |
4 |
Z |
4 |
? |
What is the possible score of user Z on thing B? There is a saying on the stock that the average value can cover up all abnormal fluctuations, so the various technical indicators on the stock clean up the average graph or column of different time periods
Graph. Similarly, the slope one algorithm also believes that the average value can also replace the scoring difference between two unknown individuals. The average value of things a on things B is very poor: (3-4) +
(2-4)/2 =-1.5, that is to say, people generally score things B 1.5 higher than things a, so Slope
One algorithm guessed that Z scored 4 + 1.5 = 5.5 for transaction B.
Is it very simple?
Weighting Algorithm
N people scored things a and B, R (A-> B) indicates the average difference (A-B) between the N people who scored a and B ), m people score things B and things c
R (c-> B) indicates the mean difference (C-B) between m people in scoring C and B. Note that the mean difference is not the square difference, now a user scores a as RA and C
RC, then a may score B as follows:
RB = (N * (ra-R (A-> B) + M * (RC-R (c-> B)/(m + n)
Open-source slope one package
- Python
Http://www.serpentine.com/blog/2006/12/12/collaborative-filtering-made-easy/
- Java
Http://taste.sourceforge.net/
Http://www.daniel-lemire.com/fr/documents/publications/SlopeOne.java
Http://www.nongnu.org/cofi/
- PHP
Http://sourceforge.net/projects/vogoo
Http://www.drupal.org/project/cre
Http://www.daniel-lemire.com/fr/documents/publications/webpaper.txt slope one algorithm written by the author, simple and clear, strongly recommended.
- Erlang
Http://chlorophil.blogspot.com/2007/06/collaborative-filtering-weighted-slope.html
- C #
C # version written by Chinese people in http://www.cnblogs.com/kuber/articles/SlopeOne_CSharp.html
- T-SQL
Http://blog.charliezhu.com/2008/07/21/implementing-slope-one-in-t-sql/
For Versions in other languages, see http://en.wikipedia.org/wiki/slope_one. the slope One Algorithm Implementation for PHP and MySQL will be available at http://code.google.com/p/openslopeone/
The source is optimized mainly for massive data and distributed processing. Currently, in my notebook (with GB memory and GB memory), I have tested the 440w scoring record, with a single thread, processing is completed in 3 hours and 47 minutes.
The speed is quite good. Recently, my work is too busy. I will open the source code and put it on the address. In a few days, I will have a detailed introduction to my algorithms. I hope you will criticize and correct them, learn together, and make progress together.