Good search engine Practice (algorithm article)

Source: Internet
Author: User
Tags pear

# # 1. Search algorithm overall architecture in the previous article (engineering article), we introduced the basic framework of the likes search engine. Search engine mainly consists of 3 components. First, the Hadoop cluster, used to generate large-scale search and real-time indexing; Second, Elasticsearch cluster, provide distributed search scheme; Third, advanced search clusters are used to provide special features for commercial search.
Commercial e-commerce search because of the particularity of the search, the independent elasticsearch cluster is unable to satisfy the diverse algorithm demand, we have the corresponding algorithm plug-in on each part of the search, is used to construct the commercial e-commerce search engine algorithm system. # # 1.1 The index procedure creates the process of indexing from the original data by creating an index process. In this process, we analyze the product (DOC), calculate the static part of the goods, and calculate the similarity of the goods. The static part of the commodity plays an important role in improving the quality of the search engine, which is equivalent to the PageRank of web search, and imagine how poor the quality of web search will be without the PageRank algorithm. In e-commerce search, the most common problem is that there are too many similar products, it is necessary to establish the index process of the similarity between the products to be pre-calculated in order to be effective in the retrieval process.
The process for creating an index is as follows.
Step 1. Calculate the static step 2 for each doc. Calculates the similarity of 22 doc step 3. According to similarity and other information to the data of the sub-Library step 4. Establish ES index # # 1.2 The retrieval process is the process by which a search engine receives a user's query for a series of processing and returns the relevant results. Commercial search engines need to consider 2 factors in the search process: 1) Relevance 2) importance.
Correlation refers to whether the return result and input query are related, which is one of the basic problems of search engine, the current algorithms have BM25 and space vector model. This two algorithm elasticsearch all support, the general commercial search engine all uses the BM25 algorithm. The BM25 algorithm calculates the correlation of each doc and query, and we use the Dscore representation.
Importance refers to the degree to which a commodity is trusted, and we should return to the consumer the most trusted commodity that is consumed, rather than letting the consumer identify itself. In particular, in a fully competitive e-commerce search, we must give the goods a reasonable importance score, in order to ensure the quality of search results. Important points, also called static points, are expressed using Tscore.
Search engines are ultimately sorted by:
Score = Dscore * Tscore
That is, the combination of static and dynamic points, to the user-related and important products.
The process of retrieving is roughly abstracted into the following steps.
Step 1. Query analysis of the original query step 2. Rewrite Step 3 According to query analysis results in AS. In as, use the rewritten query to retrieve Esstep 4. In the ES query process, according to static and dynamic sub-synthesis sort step 5. In as, the results of the ES return are re-ranked step 6. return results


The following chapters describe several key technologies.
# 2. Static technology of commodity

In the e-commerce search engine, the static part of the product is a web search inside the PageRank the same value and importance, they are the original doc and query queries unrelated value measurement. PageRank through the voting relationship between Doc, relative to the static part of the product will be more. The process of static calculation of goods as well as the PageRank need to solve the following 2 questions: 1. Stability. PageRank can guarantee that a website will not be able to increase the ranking of the website linearly because of simple link stuffing. Similarly, the calculation of the static part of a commodity does not allow a product to increase the score linearly by adding a single indicator (such as the effect of a brush on the quality of a search engine). 2. The degree of differentiation. In order to ensure the stability of the basis of the static distribution of goods to be sufficient to ensure that the same search conditions, the quality of the goods in front of the higher quality than the goods in the back.
We assume that the static division of a commodity has 3 decisive factors, 1. The next singular, 2. Praise Rate 3. Delivery speed
Static points We use Tsocre to indicate that Tscore can be written as follows:
Tscore = A * f (next singular) + b * g (positive rate) + c * H (delivery speed)
A,b,c is a weight parameter that balances the degree of influence of each indicator. F,g,h is the representative function used to convert the original indicator into a reasonable measure.
First, we need to look for a reasonable representation function.
1. Log the individual indicators first. The derivative of log is a subtraction function, which means that it takes more and more cost to get a better score.

2. Standardization. The purpose of standardization is to allow individual measurements to be compared within the same interval. For example, the next singular value is 0~10000, and the value of the praise rate is 0~1. This can affect the results and convenience of data analysis, in order to eliminate the impact of the dimension between the indicators, it is necessary toTo standardize data processing to address the comparability of data metrics. The most common standardized method is the Z-score standardized method.
Z-score Standardization Method
"probability theory" tells us that for data that satisfies a normal distribution, the range of 3 Z-score before and after the mean can cover 99% of the data. Experience, we set >5 Zscore or less than 5 zscore fractions to 5*zscore or -5zscore .
in particular, we do not recommend the use of the Min-max standardized method. This method is also called dispersion normalization, is a linear transformation of the original data, so that the result value mapping between [0-1], the conversion function is as follows:

This method is very unstable, assuming that a singular point is 1000 times times the second-largest value, which will cause most of the values to be concentrated in the 0~0.01, as well as losing the goal of normalization.
Figure one is the use of Min-max normalized data distribution, obviously most of the data is "squashed" in a very small range; Figure II Using log normalization after the data distribution, because the log to mitigate the growth rate, you can see that there is a good result, figure three is on the basis of log z-score normalization, you can see, z-score make the data very smooth.

(Figure one: Min-max normalization) (Figure II: Log normalization)
(Figure III: Log-zscore normalization)
Finally, choosing the right weights after log-zscore normalization, we basically f,g,h the representation of the representative function is clear.Tscore = a*f (next singular) + b*g (rated rate) + c*h (delivery speed), the next step is to determine the a,b,c parameters. There are generally two methods:
A) Expert law. Dynamically adjust the weight parameters according to our daily experience; b) Experimental method. First assign an initial value with the help of the expert, then change the method of the single variable to adjust the parameter dynamically according to the abtest result.

# # 3. Product title to go heavy
Product Title Deduplication plays an important role in e-commerce search, and according to the data, the user purchases 80% of the first 4 pages of the search through the search page. Duplication of product titles can lead to important pages without gold content, greatly reducing the purchase rate of the search.
As an example:
Title1: Delicious/banana/mail/Guangdong/Gaozhou/Banana/banana//no/ripening agent/

Title2: Delicious/Banana/Guangdong/Gaozhou/banana//Non/pink banana/mail/
First, to quantify the feature
Here is the "bag of word" technique, which uses the glossary as the dimension of the space vector, and the word frequency of each term in the title as the feature value. In this case. The dimension of this vocabulary is: delicious (0), banana (1), Mail (2), Guangdong (3), Gaozhou (4), banana (5), none (6), ripening agent (7), non (8), Pink Banana (9) Location: 0,1,2,3,4,5,6,7,8,9
title1:1,2,1,1,1,1,1,1,0,0title2:1,2,1,1,1,0,0,0,1,1
Each title is represented by a fixed-length vector.
Again, calculate 22 similarity
Similarity is generally achieved by calculating the distance between two vectors, where we use 1-cosine (x, y) to represent the distance between two vectors. This is an "all Pair similarity" problem, which requires 22 comparisons, with a complexity of O (n^2). It is difficult to handle a single machine when the volume of goods is huge. We give two ways to implement "all Pair similarity".
method One: The matrix operation of Spark.
```
Sc.parallelize (["1 0 2 0 0 1""0 0 4 2 0 0"])

Rddrows.map (xvectors.dense ([floatstr(x). Split ("" " )]))
Rowmatrix (Rddrows)

Mat.columnsimilarities ()
```
This method is simply described as follows:
First, each term-to-doc mapping is computed in the same way as the inverted index.
For example, 3 doc:
```
DOC2 = i Beijing Tiananmen
DOC3 = My Tiananmen Square
```
Convert to inverted format, this requires one mapper reduce
```
I-     Doc1, DOC2, DOC3
Love-     Doc1
   Doc1, DOC2, Beijing
```
Then, for value only one element is filtered out, for value greater than 2 Doc 22 combinations:
```
I-     Doc1, DOC2, DOC3
I-     Doc1, DOC2, DOC3
I-     Doc1, DOC2, DOC3
DOC1,DOC2 <----from:    Doc1, DOC2, Beijing
Tiananmen Square, DOC2, DOC3
```

Finally, for the output to be aggregated, value is the number of repetitions and the ratio of two doc product open root.
```
DOC1,DOC2, 2/(Len (Doc1) *len (doc2)) ^1/2 = 0.7
DOC1,DOC3, 1/(len (doc1) *len (doc3)) ^1/2 = 0.3
DOC2,DOC3, 2/(len (doc2) *len (doc3)) ^1/2 = 0.3
```
For 2 Title1, Title2, if X (Title1, title2) > 0.7 is considered title1 and title2 similar, for similar two doc, static sub-large definition of the main doc, static sub-small definition supplemented by Doc. The main doc and the auxiliary doc are built separately.


Apart from web search (Web search directly removes the auxiliary doc), we will build the main doc and the auxiliary doc separately. Each search proportionally searches the main and secondary libraries, and merges the results back. This guarantees a multiplicity of results.

# # 4. Shop to heavy shop to heavy and product title to weight a little different. Because of the needs of e-commerce specific scenarios, do not want the search results of a single large, this will trigger a strong Matthew effect. Store-to-weight cannot be done using the method as above. Because the main basis of the above method is similar text, in the results are related to the premise of appropriate trade-offs. But the store-to-weight is not such a feature.
Imagine if we split the same store's merchandise into the main library and from the library, as shown in the store, depending on whether the shop is the same.

A and B represent different stores.
When searching for bananas, it is true that you can control the number of results of a store, but in the search for "pears" the pear is in front of the B store (assuming a: pear is higher than B: Pear is static).
Actually want to achieve store to go heavy effect through the bucket search is very easy to do things. Let's say we search 20 results per page, we divide the index library into 4 buckets, each of which models the number of buckets. This ensures that the same store's merchandise is only in one bucket.

The search process averages 25% of search tasks per bucket and merges the results into one page based on static points. In this way, the relative order of the results of the same guarantee achieves the purpose of store-heavy.
As shown, search for "bananas", although a store has 10 satisfied with the results of demand, but each page search drunk only 5 results can be displayed.
# # Query Analysis and query rewriting technology
Several techniques for indexing are described above, and there are many key technologies in the retrieval process. One of the most famous is the query analysis technology. We use the query analysis technology mainly includes core word recognition, synonym extension, brand word recognition and so on. Most of the query analysis technology is the scope of NLP research, this article does not elaborate a lot of theoretical knowledge. We focus on synonym extension techniques. This technology generally needs to be based on their own goods and user log-specific training, can not be like word segmentation technology and brand word recognition of the same standard library is applicable.

Synonym extension is usually obtained by analyzing the user session log. If a user enters "apple phone" without getting the desired result, he then enters "iphone" and we create a transfer relationship between "Apple" and "iphone". Based on statistics, we can create a weighted graph of relationships between user query.


User input query "Apple phone", according to query analysis, "Apple phone" has "iphone" *0.8, "iphone 6" *0.5 two synonyms. 0.8 and 0.5 indicate the degree of the synonym, respectively. We want "Apple phone", "iphone", "iphone 6" 3 query input simultaneously and give different weights to different query according to the degree of synonym. The boostingquery provided by Elasticsearch can support this requirement. Reference: https://www.elastic.co/guide/en/elasticsearch/guide/current/_boosting_query_clauses.html
Original query: ' ' {' query ' {"Match": {"Query": "Apple Phone"        }}
} ' rewrite the query '{  "Query": {    "should": [        {"Match": {          &NBSP ; "Content": {                "query": "Apple Phone",          &NB Sp     "Boost": 10            }       }},    &NB Sp   {"Match": {            "content": {                "Query": "iphone",                "boost": 8        &NB Sp  }       }},        {"Match": {            " Content ": {               " query ":" Iphone6 ",          & nbsp     "Boost": 5           }       }}   ] }} ' ``
Other such as core word recognition, ambiguous word correction and other methods are similar, this article does not elaborate.
# # Other
Commercial e-commerce search algorithm two other important technologies, one is the establishment and application of the class system, and the other is personalized technology. We are still in the exploratory phase of these two technologies. class system We mainly use machine learning methods for training, personalization is mainly through the user portrait of the query rewrite to achieve. And so on our online has the effect to share with you.
# # Summary
The search algorithm is a technology that is well worth the continuous investment of an e-commerce product. On the one hand, our technical staff to have a good technical background, can learn from a lot of mature technology, avoid repeating the wheel; On the other hand, each product search has its own characteristics, need to deeply study the characteristics of the product to give a reasonable solution. The examples given in this paper are representative and flexible in the use of various aspects of the search technology. In addition, business search is very important to the input-output ratio, we also need to find a shortcut in many scenarios. For example, when we were doing the class system, we did not invest a lot of human resources to label data, but crawled through other e-commerce data for reference, thus saving 80% of human resources. Due to the limited ability of the author, the scheme in this paper is not guaranteed to be the optimal solution of the problem, if correct, please contact the author ([email protected]).

















From for notes (Wiz)

Good search engine Practice (algorithm article)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.