Search engine-a cue word recommendation algorithm

Source: Internet
Author: User

Search engines can say that there is one of the highest technical content in all Internet applications. Although the application form is relatively simple: the user input query words, the search engine returns results. However, search engines need to achieve the goal: more complete, faster, more accurate. How to make search results more accurate is always a big problem for search engines.

The company recently developed a vertical search engine for a certain industry, and as a core member of the project team, I was primarily responsible for the research work of the core algorithms. I was just beginning to contact the industry, is still in the groping stage, there is a long way to go.

Let's talk about the background of the project. This project is an industry-nature vertical search engine. Users are divided into two main categories: ordinary users, professional users. The whole project is divided into: Crawler technology group, Engine group, Big Data analysis Group and algorithm group. Engine crawler, the establishment of thesaurus and engine selection are not the focus of this article, just a stroke, the focus is on the design of recommended algorithms.

First, web crawler

System data that needs to be crawled from several professional websites. Try a few crawlers, and finally select Heritrix most of our crawler frame, the main reason to choose it is the sense of configuration items although more, but more flexible, particularly suitable for our requirements. Of course, crawling from the technical Group also tried to implement a crawler, mainly crawl address comparison fixed several data.

Second, the establishment of the word Bank

The thesaurus is divided into professional subject headings, common Thesaurus of the industry, General thesaurus, Waste Thesaurus, and a thesaurus for sentiment analysis.

The implementation of the Professional thesaurus is a manual way to deal with, and produced a number of auxiliary tools for professionals to select, merge, delete the operation of the keyword.

The implementation of the following several thesaurus, is the first selection of Sogou and other input methods of the Word library based on the base of these thesaurus to the crawler crawling out of the document to quantify.

Third, the construction of the engine

The data are de-noising, segmented and feature extracted, and then the corresponding data is imported into SOLR .

Iv. Recommended Algorithms

When the user input keyword query, how to make the user query more accurate? We envisage that, for the user input, if we can give a number of words with the user input keyword similarity very close to these as query conditions, if our algorithm is good enough, the results will greatly increase the accuracy of the search. Here is a detailed idea of the algorithm:

From a vectorization point of view, each document corresponds to a vector, which represents the feature item I.

is a vector determined by the words, the position of the word, the TF and so on. For version 1, we only take the word and the position of the word. We first use the classification rules to divide the document into several classes, based on each class to calculate the following:

Represents the similarity of two feature items.

Let's define the distance formula.

We find the similarity of feature items for each document's feature item, 22. By this distance formula

We can conclude that for each classification, these feature items are vertices, with the similarity distance as the edge, the structure

The following is the graph of the non-direction.

(lawnet)

Analogy to the WordNet and hownet, We call this graph, for LawNet.

The problem of our imagination is transformed into: Select any vertex to find several (for example, Ten) The smallest spanning tree with these vertices or the minimum and minimum sub-graph of the edge weights. This is a local optimal stochastic problem. that is, we only need to meet the user's approved level of experience, if the probability is 90%, that is, when the user input 10000 times, we can successfully give 9000 words on the tip of the line.

The current solution I have tried two kinds:

A PRIM algorithm.

The second algorithm: first through the Floyd algorithm, calculate any two points of the shortest distance, as an edge, these edges are combined into a set. Then give any vertex, from this edge set, to find the top N smallest edges containing this vertex.

Search engine-a cue word recommendation algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.