SOLR Learn one search basics

Source: Internet
Author: User
Tags solr knowledge base

Study for a period of time SOLR, in their own way to summarize the current learning, this is a series of articles, some of the statements may not be accurate, there may be problems

You are welcome to correct me.

First, the purpose of the search engine

Search engine in our lives, has been everywhere, in addition to our commonly used Baidu, Google and so on, there are some e-commerce searches such as Amazon search books. In addition to Web search, within the enterprise may involve knowledge base search, commonly called enterprise search. The main purpose of search now is to quickly find information that fits our meaning from unstructured data in massive amounts of information . Pay attention to some key words here.

"Huge amount of information": Search engine general processing of data is very large, the general database in the search for a very large amount of data, such as hundreds of millions of data, even if indexed, query speed is not fast, can not meet the needs of reality.

"In line with our meaning": I think this can be called Semantic search ( not yet seen in this name:), the first may be a database can barely reach, but the database search is difficult to achieve the purpose of meaning, which in the search process involves the conversion of synonyms. For example, you search for SOLR in the Amazon, you can find Lucene, search engine-related books, which involves the conversion of synonyms, I think this is the most important feature of the search engine.

The general search engine matches the content of our search and existing documents according to relevance, scores the similarity, and sorts them in similar order, similar to the previous ones.

Unstructured data: This refers to data that has no fixed format and fixed length, such as an article, which is also known as full-text data.

Second, the search engine principle 2.1 Common unstructured data retrieval methods

According to the above section, the main processing of the search engine is non-structured data, so the name Incredibles, unstructured data is characterized by no fixed structure, which is the reason for the difficulty of processing; structured data can be processed by means of a database. How unstructured data is handled is said to be in two ways:

One is sequential search, such as using grep to search for documents containing specific strings under Linux, which is more effective when the number of documents is small.

Two is full-text search, it is through the unstructured data structured transformation, the unstructured data extraction (from the document to extract words), and then re-assemble, and then use it to search.

The information that is extracted and re-organized is called an index .

The second method is the main method used in search engines.

2.2 Three problems of full-text search

This involves three questions: 1. What information is stored in the index? 2, how to establish the index? 3. How do I search using an index?

What does the index hold?

Let's think, now that the full-text search is done by indexing, then our index content is necessarily for the convenience of finding the information we want.

To search for articles, for example, if we need a huge number of articles, in order to facilitate the management of the articles we give these numbers,

Search is to find the corresponding relationship between the search keyword and the article number, and then find the corresponding document by number. Then it is natural to store the contents of the index must have the part that matches the keyword, and also the part of the article number.

The actual situation is the same, the index of the full-text search is simply the corresponding relationship between the word and the article number, because a word can be placed in multiple articles, so this index is usually a word after the face should be a series of article number (document number linked list).

The corresponding word from the article is more natural, so the word correspondence document is the inverse process of the document corresponding Word, the index that holds this information is called Reverse index (inverted index).

Stealing an online description of Lucene's inverted index schematic diagram, SOLR is based on Lucene, so SOLR's index is also inverted index.

In this case, the keyword is generally called a dictionary , followed by a series of document number article number is called inverted list .

How indexes are built

The process of creating an index steals nets, and the indexing process is as follows:

Let's think about the index that we're going to set up as the inverted index, which is the dictionary and the inverted table.

We first want to get the word from the document, so our first work is participle, here is used to divide the phrase (Tokenizer) This time the result is the word element (Token).

Secondly, the word element (Token) is processed by the language Processing component (linguistic Processor) , and the output is the Word (term). The main accomplishment of this part is the transformation of word meanings, etc.

The word side becomes more pure and broken. The final step is to pass the word to the relevant index component and build the index.

1, the main work of participle

1) Cut words, English is better than the split is separated by a space, Chinese characters to be involved in the use of a dictionary to cut words or separate words to cut words.

2) Remove punctuation.

3) Remove the stop word, stop Word refers to the language does not have a special meaning of the adverb, such as the English in this, is, a, Chinese "" and so on.

In SOLR there are special profiles configured to stop the word, stopwords_ the beginning of the configuration file.

Make the resulting word more meaningful, reduce the length of the index, because the stop word in many documents have, if added to the index, the following document number to be long, the professional noun is called zipper too long.

Not only takes up too much space, but also causes the search to become slower.

Chinese word segmentation is much more complicated because there is no obvious direct spacing of words. The default standardtokenizerfactory in SOLR is delimited by word, and the benefit is that word-level matching is achieved and the bad index becomes larger.

can also take advantage of online open-source word breaker components, such as: Cook looked through participle, ik participle and so on.

Efficiency SOLR Default Word segmentation index efficiency is probably 1 time times the IK participle, but the query efficiency is 4 times times slower reason, is the word segmentation zipper too long reason.

2, the main work of language processing

1) is a case-and-write conversion for English.

2) Reduce or simplify the word as a root. (e.g. cars turned into car, running turned into run)

The language processing component results from the term (word).

3. Pass this to the index component

1) Use words to create a corresponding relational table of dictionaries and document IDs.

2) sort the dictionaries in dictionary order.

3) Merge the same dictionary, the document ID becomes the document ID linked list.

The actual inverted index contains information such as the position of the word in the document, frequency of occurrence, etc.

Search process

1) We use the syntax of the search engine to enter the query statement. Our popular search engine Baidu

Examples of common syntax are as follows:

1. If you want Baidu to be the overall search without the word breaker, enclose it in double quotes.

2. If you do not want some information can use-number, such as mobile phone-promotion, will not show Baidu's promotional ads.

3. For example, search keywords are or relationships, you can search by xxx|yyy way search.

2) Lexical analysis, syntax analysis and semantic processing of query statements.

Similar to the process of building an index, after the word segmentation, conversion, there is more than one content to distinguish between keywords and search terms,

Keywords represent the logical relationship between search terms and are op-marked in SOLR search, such as the and label logic and

An or relationship, after which a syntax tree is formed.

3) Search

In Solr, there are roughly three steps to complete:

1. Find the document ID that meets the requirements in the Reverse Index table. The first query returns the document ID and the match score.

2, according to the syntax tree for logical AND OR, and so on, get the Final Document ID list.

3, through the list of these document IDs, combined with the requirements of the query content to query specific content information to return.

4) Sort return

After the document list is checked, the query's statement is treated as a document to calculate the correlation between the document being queried and the document being queried, and to score

High correlation is in front.

The calculation of correlation score is more complicated, which mainly involves:

Word frequency TF: The number of times that words appear in a document.

DF: The number of documents that the word appears in.

The weight of the word: the importance of the word in the document.

SOLR Learn one search basics

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.