Transferred from:Http://www.yeeach.com/2010/10/01/%E5%9F%BA%E4%BA%8Elucene%E5% AE %9E%E7%8E%B0%E8%87%AA%E5%B7%B1%E7%9A%84%E6%8E%A8%E8%8D%90%E5%BC%95%E6%93%8E/
Data mining-basedAlgorithmTo achieve the recommendation engine is the major e-commerce websites, SNSCommunityThe most common method is the content-based recommendation algorithm and collaborative filtering algorithm (item-based and user-based) the introduction to e-commerce recommendation system V2.0 and E-Commerce recommendation system has already been described. However, from the perspective of practical application, it is very difficult for most small and medium-sized enterprises to fully adopt the above algorithms in e-commerce systems.
1. common recommendation engine algorithm problems
1) relatively mature, complete, and readily available open-source solutions
Currently, open-source projects related to data mining and receng mainly include the following types:
Data Mining: mainly including WEKA, R-project, knime, rapidminer, orange, etc.
Text Mining: mainly including opennlp, lingpipe, freeling, gate, and carrot2. For details, refer to lingpipe's competition.
Recommendation engine: mainly includes Apache mahout, duine framework, and Singular Value Decomposition (SVD). For other packages, see open source collaborative filtering written in Java.
Search engine problems: Lucene, SOLR, sphtasks, Hibernate search, etc.
2) common recommendation engine algorithms are relatively complex and have a low entry threshold.
3) algorithms of common recommendation engines have low performance and are not suitable for massive data mining.
In addition to the relatively mature Lucene/SOR, most of these packages or algorithms are still used in academic research and cannot be directly used in large-scale Internet Data Mining and recommendation engine.
2. advantages of using Lucene for recommendation engine
For many small and medium-sized websites, due to limited development capabilities, such a solution is certainly very popular if it can be integrated with search and recommendation. Using Lucene to implement the recommendation engine has the following advantages:
1) Lucene has a low entry threshold. Most websites use Lucene for intra-site search.
2) compared with the collaborative filtering algorithm, Lucene has high performance.
3) Lucene has many ready-made solutions for Text Mining, similarity calculation, and other algorithms.
In open-source projects, mahout or duine framework is a relatively complete solution for the recommendation engine. In particular, the mahout core uses Lucene, so its architecture is worth learning from. However, the current features of mahout are not complete, and the recommendation engine that directly uses it to implement e-commerce websites is still not very mature. However, we can see from the mahout implementation that using Lucene to implement the recommendation engine is a feasible solution.
3. core issues to be addressed by the recommendation engine using Lucene
Lucene is good at text mining. Lucene provides the morelikethis function in the contrib package and can easily implement Content-based recommendations, however, Lucene does not have a good solution for results that involve user collaborative filtering behaviors (the so-called relevance feedback. You need to add the user collaborative filtering behavior factor to the Lucene content similarity algorithm to convert the user collaborative filtering behavior result into a model supported by Lucene.
4. receng Data Source
Typical behaviors related to e-commerce websites and recommendation engines:
-
- Customers who have purchased this product have also bought
-
- Customers who browse this product have also seen
- Browse more similar products
-
- People who like this product also like it.
-
- Average user score for this item
Therefore, the recommendation engine based on Lucene mainly needs to process the following two types of data:
1) content Similarity
Example: product name, author/Translator/manufacturer, product category, description, comment, user tag, system tag
2) User collaborative behavior Similarity
For example: Tag, Purchase Product, click stream, search, recommendation, favorites, score, write comments, Q & A, page stay time, group, etc.
5. Implementation Scheme
5.1. Content Similarity
It can be implemented based on Lucene morelikethis.
5.1 handling of user collaborative behavior
1) users use Lucene to index each collaborative behavior, and each behavior is recorded by one record.
2) The index record contains the following important information:
Product Name, product ID, product category, product description, tag, and other important feature values, feature elements of other products associated with user behavior, product thumbnail address, and collaborative behavior type (purchase, click, add to favorites, rating, etc), boost value (the weight of each collaborative action in setboost)
3) collaborative behaviors such as rating, favorites, and clicks are characterized by product feature values (tags, titles, and summary information)
4) set different setboost values for different collaborative behavior types (such as purchase, scoring, and clicking)
5) use the Lucene morelikethis algorithm to convert user collaboration into content similarity during search
The above solution is the simplest implementation solution for the recommendation engine based on Lucene. The accuracy and refinement of the solution will be detailed later.
For more detailed implementation, you can refer to the mahout Algorithm Implementation for optimization.