Using the algorithm based on data mining to realize recommendation engine is the most common method of E-commerce website, SNS community, recommended engine commonly used content-based recommendation algorithm and collaborative filtering algorithm (item-based, user-based in e-commerce recommendation System Entry v2.0, The introduction of e-commerce recommendation system has been elaborated. But from the practical application, for most small and medium-sized enterprises, it is very difficult to adopt the above algorithm in the electronic commerce system.
1, the common recommendation engine algorithm problem
1, relatively mature, complete, off-the-shelf open source solution is less
Roughly, there are several types of open source projects related to data mining and recommendation engines:
Data mining Related: Mainly including Weka, R-project, Knime, RapidMiner, Orange, etc.
Text mining Related: Mainly including OPENNLP, Lingpipe, Freeling, GATE, etc., can refer to Lingpipe ' s competition
Recommendation engine Related: Mainly includes Apache Mahout, duine framework, Singular Value decomposition (SVD), other packages can refer to open Source Collaborative Written in Java
Search engine Related: Lucene, SOLR, Sphinx, Hibernate search, etc.
2, the commonly used recommendation engine algorithm is relatively complex, entry threshold is lower
3, the common recommendation engine algorithm performance is low, is not suitable for massive data mining
These packages or algorithms, in addition to lucene/sor relatively mature, most of them are still in the academic research use, and can not be directly applied to the Internet large-scale data mining and recommended engine engine use.
2, the advantage of using Lucene to implement recommendation engine
For many small and medium-sized Web sites, because of the limited development capacity, if there is integration of search, recommend integrated solutions, such a solution is certainly popular. Using Lucene to implement the recommendation engine has the following advantages:
1), Lucene entry threshold is low, most sites in the site search are used Lucene
2, compared to the collaborative filtering algorithm, Lucene performance is high
3, Lucene to text Mining, similarity calculation and other related algorithms have a lot of ready-made solutions
In open source projects, the mahout or duine framework is a relatively complete solution for recommending engines, especially the Mahout core utilizes Lucene, so its architecture is well worth learning. Just mahout the current function is not very complete, directly with its implementation of E-commerce Web site recommendation engine is not very mature. It can be seen from the mahout implementation that using Lucene to implement the recommendation engine is a feasible scheme.
3, the core problem to be solved by using Lucene to implement recommendation engine
Lucene good at text mining better, Lucene in the contrib package provides the Morelikethis function, can be easier to achieve content-based recommendations, However, Lucene currently does not have a good solution for the results that involve user collaborative filtering behavior (called relevance Feedback). We need to add the user collaborative filtering behavior to the content similarity algorithm in Lucene, and convert the user collaborative filtering behavior result into the model supported by Lucene.
4, recommendation engine data source
E-commerce websites are typically associated with recommendation engines:
buyers of this product have also bought a customer who has browsed this product and seen more similar products like this product and also like the average rating of the product by the user
Therefore, the recommendation engine based on Lucene mainly deals with the following two kinds of data
1), Content similarity
For example: Product name, author/translator/manufacturer, product category, profile, comment, user label, System label
2, User synergy behavior similarity
For example: tag, buy goods, click Stream, Search, recommend, collection, scoring, write comments, questions and answers, page stay time, group, etc.
5, the implementation of Scheme 5.1, content similarity
Based on the Lucene morelikethis implementation.
5.1, dealing with user's cooperative behavior
1, the user each coordinated behavior uses Lucene to index, each behavior one record
2), the index record contains the following important information:
Commodity name, commodity ID, commodity category, product introduction, label and other important features, user-related behavior of other commodities, product thumbnail address, cooperative behavior type (purchase, click, collection, rating, etc.), boost value (the setboost of the coordinated behavior at the time of the weight value)
3), to score, collection, click and other cooperative behavior in commodity characteristics (tag, title, summary information) to characterize
4, different types of collaborative behavior (such as purchase, scoring, click) set different values Setboost
5), the search time uses the Lucene morelikethis algorithm, transforms the user collaboration to the content similarity degree
The above scheme is only based on Lucene to achieve the most simple recommendation engine implementation, the accuracy of the scheme and detailed plans to elaborate.
More detailed implementation, can refer to mahout algorithm implementation to optimize.
Source: http://www.yeeach.com