E-commerce website search architecture solution

Source: Internet
Author: User
ArticleDirectory
    • 4.1.1 database product table analysis:
    • 4.1.2 Lucene indexing program:
    • 4.1.3 Lucene index Library:
    • 4.3.1 Weight Calculation
    • 4.3.2 Automatic Index Update

The e-commerce search architecture solution is actually the application of e.net. If the company has a small temple and few people, you can also take a look at it at ordinary times. I have done some examples before, so that I can write the architecture solution. I am a lazy guy. I am crazy about searching architecture on the Internet. I haven't written it yet because of my architecture.CodeI read my blog every day, and I got a rough sketch after two weeks. This is not the last day of the Tomb Sweeping Day. Now I can only share my sketch, I hope everyone will be waiting for it. It is depressing to be bored at home, and the efficiency is too low.

Based onLuceneSearch Scheme

1. Lucene Introduction

Lucene is a top-level open-source project of Apache. It is a full-text search engine implemented by Java. It supports full-text indexing and retrieval based on various document formats, including word and PDF, excluding graphics.

Lucene.net is a C # version of Lucene, which is translated by Java Lucene and published by Apache as an open-source project. Its functions are basically the same as those of Java. However, it lacks good technical support andCommunityActivity, which has been put into the incubator by Apache

Lucene write: The source file is processed by analyzer, including word segmentation, weight processing, generating document records, and writing to memory (hard disk or memory ).

Lucene read: analyzer the search keywords, including word segmentation, weight, and range matching. The source code structure is as follows:

 

The specific process is as follows:

 

The data flow diagram is as follows:

 

Ii. common recommendation Engines Algorithm Problem

Using data mining algorithms to implement recommendation engines is the most common method for e-commerce websites and SNS communities, recommendation engines commonly use content-based recommendation algorithms and collaborative filtering algorithms (item-based and user-based ). However, from the perspective of practical application, it is still very difficult for most small and medium-sized enterprises to fully adopt the above algorithms in e-commerce systems.

1) relatively mature, complete, and readily available open-source solutions

Currently, open-source projects related to data mining and receng mainly include the following types:

Data Mining: mainly including WEKA, R-project, knime, rapidminer, orange, etc.

Text Mining: mainly including opennlp, lingpipe, freeling, gate, and carrot2. For details, refer to lingpipe's competition.

Recommendation engine: mainly includes Apache mahout, duine framework, and Singular Value Decomposition (SVD). For other packages, see open source collaborative filtering written in Java.

Search engine problems: Lucene, SOLR, sphtasks, Hibernate search, etc.

2) common recommendation engine algorithms are relatively complex and have a high entry threshold.

3) algorithms of common recommendation engines have low performance and are not suitable for massive data mining.

In addition to the relatively mature Lucene/SOR, most of these packages or algorithms are still used in academic research and cannot be directly used in Internet-scale data mining and recommendation engine.

(All of the above are based on Java and need to be researched and implemented by yourself, which is very difficult)

Note: In addition to category search and active search, the recommendation system is also an important way for users to browse products. It can help users discover similar and interesting products and increase product access, convert visitors into buyers and guide users to purchase. The ultimate value is to improve the user shopping experience and user viscosity, and increase the order volume. For example, amazon30% of orders come from the recommendation system.

Advantages of using Lucene for recommendation engine

For many small and medium-sized websites, due to limited development capabilities, such a solution is certainly very popular if it can be integrated with search and recommendation. Using Lucene to implement the recommendation engine has the following advantages:

1) Lucene has a low entry threshold. Most websites use Lucene for intra-site search.

2) compared with the collaborative filtering algorithm, Lucene has high performance.

3) Lucene has many ready-made solutions for Text Mining, similarity calculation, and other algorithms.

In open-source projects, mahout or duine framework is a relatively complete solution for the recommendation engine. In particular, the mahout core uses Lucene, so its architecture is worth learning from. However, the current features of mahout are not complete, and the recommendation engine that directly uses it to implement e-commerce websites is still not very mature. However, we can see from the mahout implementation that using Lucene to implement the recommendation engine is a feasible solution.

3. core issues to be addressed by the recommendation engine using Lucene

Lucene is good at text mining. It provides the morelikethis function in the contrib package and can easily implement Content-based recommendations, however, Lucene does not have a good solution for results that involve user collaborative filtering behaviors (the so-called relevance feedback. You need to add the user collaborative filtering behavior factor to the Lucene content similarity algorithm to convert the user collaborative filtering behavior result into a model supported by Lucene.

Receng Data Source

Typical behaviors related to e-commerce websites and recommendation engines:

Customers who have purchased this product have also bought

Customers who browse this product have also seen

Browse more similar products

People who like this product also like it.

Average user score for this item

Therefore, the recommendation engine based on Lucene mainly needs to process the following two types of data:

1) content Similarity

Example: product name, author/Translator/manufacturer, product category, description, comment, user tag, system tag

2) User collaborative behavior Similarity

For example: Tag, Purchase Product, click stream, search, recommendation, favorites, score, write comments, Q & A, page stay time, group, etc.

5. Implementation Scheme

5.1. Content similarity is implemented based on Lucene morelikethis.

5.2 handling of user collaborative behavior

1) users use Lucene to index each collaborative behavior, and each behavior is recorded by one record.

2) The index record contains the following important information:

Product Name, product ID, product category, product description, tag, and other important feature values, feature elements of other products associated with user behavior, product thumbnail address, and collaborative behavior type (purchase, click, add to favorites, rating, etc), boost value (the weight of each collaborative action in setboost)

3) collaborative behaviors such as rating, favorites, and clicks are characterized by product feature values (tags, titles, and summary information)

4) set different setboost values for different collaborative behavior types (such as purchase, scoring, and clicking)

5) use the Lucene morelikethis algorithm to convert user collaboration into content similarity during search

The above solution is the simplest implementation solution for the recommendation engine based on Lucene. The accuracy and refinement of the solution will be detailed later.

For more detailed implementation, you can refer to the mahout Algorithm Implementation for optimization.

Other open-source search engine tools recommended: SPHinX, which is currently based on sphinx, an open-source full-text search engine from Russia. A single index can contain a maximum of 0.1 billion records, and the query speed is 0 in the case of 10 million records. X seconds (in milliseconds ). The index creation speed of sphenders is: 3 to 3 for creating an index with 1 million records ~ In four minutes, you can create an index of 10 million records within 50 minutes, and only the incremental Index containing the latest 0.1 million records takes dozens of seconds to recreate the index.

Sphinx is a free open-source full-text search engine based on the GPL 2 protocol. it is designed to better integrate the script language and SQL database. currently, the built-in data source supports getting data directly from the connected MySQL or PostgreSQL, or you can use the XML channel structure (XML pipe mechanic, an index channel based on the special XML format that sphenders can recognize)

Lamp-based architecture is widely used. Currently, we know that the commercial applications include the discuz Enterprise Edition of kangsheng.

,

3. Mobile phone home search solution (for reference)

The current Lucene application of mobile phone home uses a combination of Lucene 2.4.1 + JDK 1.6 (64 bit) and runs on 8 CPU and 32 GB memory machines, with more than 33 million data records, the raw data file exceeds 14 GB and requires more than 0.35 million queries per day. The QPS exceeds 20 during peak hours. Looking at the data alone, there may be no major bright spots, but its reconstruction and updates are automated, and the two tasks can run simultaneously. On the other hand, without affecting service reliability, update data as quickly as possible (if there is a conflict between the two, ensure availability and delay updates). The workload is still very large..

 

PPTConnection Http://www.slideshare.net/tangfl/lucene-1752150

In large-scale applications, Lucene is more suitable for narrow search, rather than data storage. Let's take a look at Lucene'sSource codeYou can also know that the storage efficiency of document and field is not good enough. The mobile phone home team also discovered this. Their solution is to store indexes using Lucene and store indexes using memcache + Berkeley dB (Java edition. There are two advantages: one is to reduce the data size of Lucene and improveProgramOn the other hand, this system can also provide some SQL-like Query functions. Actually, Lucene itself seems to have noticed this problem. It adds a DB option in store, which is actually used by Berkeley dB.

In large-scale applications, cache is very important. As mentioned in the PPT, you can perform a "push" search several times before the program provides services to fill the searcher cache. Based on our (Ginkgo search) experience, we can also provide the document-specific cache in the application, which greatly improves the performance. Lucene also seems to have noticed this problem. It provides a cache in version 2.4 and an LRU cache implementation. However, according to our test, in extreme cases, this cache may break through the size limit and eventually eat up the memory, even many LRU cache implementations found on the Network may encounter such problems in extreme conditions. In the end, I wrote an LRU cache myself and modified it multiple times, currently, it is stable.

When writing a Java service program, remember to set the exit hook function (runtime. getruntime. addshutdownhook) as a good habit. Many Java programmers do not have this kind of consciousness, or they only write a finalize function. As a result, when the program unexpectedly exits, the status of some external resources may be unstable. For Lucene, the previous indexwriter is autocommit by default, so that each record is submitted once. The advantage is that if the record is interrupted, the previously added record is available. The disadvantage is that, the indexing speed is very low. In the new version, autocommit is set to false by default, and the speed is significantly improved (the test result is improved by about 8 times). However, if an exception exits during the process, all the work is done. If the exit hook function is added and the exit signal is captured, it is automatically called.Writer. Close ()To avoid this problem.

Currently, Lucene is compatible with JDK 1.4, and its binary version is also compiled by JDK. If you have high performance requirements, you can download the Lucene source code and compile it with the latest JDK version. jar files. According to my tests, the speed is improved by about 30%.

Iv. XX net search solution 4.1 preliminary solution:

This feature allows you to perform word segmentation search, recommendation keywords, and simple sorting for products on the site. It is automatically updated on a regular basis, and the index read/write splitting is implemented.

Server-based search is under great pressure and the user's search experience is insufficient. The initial solution aims to solve the search pressure on the server, achieve initial word segmentation search, and automatically and regularly maintain indexes.

4.1.1 database product table analysis:

L large base table

L product category extension base table

L brand base table and Brand Series base table

L product base table (master table)

L color base table

The base table of the product contains about 80 thousand data records and occupies about 40 MB of space. The data volume of a single table is relatively small.

4.1.2 Lucene indexing program:

Use the Lucene Index program to read the data in the database and then write it to the Lucene custom index file. This index file does not need to be searched and needs to be replaced with the search index. Perform word segmentation during indexing. The word segmentation component adopts the pangu word segmentation component developed by eaglet (it is already open-source based on Apache open-source protocol and requires secondary development for further functions ).

4.1.3 Lucene index Library:

The base table index file is about MB, which can be divided into the database for writing and the database for searching. After the database is written into the database, it is merged into the search database, considering the possible Index program errors or index file damages when the new index is overwritten, the search program can read and write the files in the index through program control at the same time.

L search processing services: searches based on product libraries, such as brands, categories, and price ranges. The search program depends on the interface. Database-based search and file-based search should be switched as needed at any time. When searching, you must use the word segmentation component for word segmentation to search the results after word segmentation. The database does not perform word segmentation for the time being.

L Query Processing: the front-end program uses MVC to display product word segmentation highlights, query by category, query by brand category, and query by price range.

 

4.2 Step 2 keyword statistics:

Search Keyword collection and search joint processing, to achieve simple search recommendation function. It mainly performs statistical analysis on the search keywords at the front end, and is associated with the sorting of the search. After the preliminary processing of the keywords and the associated index scheme with the master table are completed, a complete solution is provided.

 

4.3 step 3 optimized and improved:

Supports automatic message-based incremental update, weight calculation, recommendation product computing, and real-time search of index files. For weight calculation, you need to re-develop your own Vector Algorithm engine.

Real-time search is currently learning.

4.3.1 Weight Calculation

The weight calculation method associates the statistical data of foreground users with the product database to develop a weight sorting Calculation Method for daily network products. The following algorithm flowchart is just an idea.

Weight Calculation Design

 


 

4.3.2 Automatic Index Update

Establishes an index update and maintenance mechanism based on the message mechanism.

Message Queue-based index Generation Program

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.