The implementation principle of SOLR or Lucene full-text retrieval

Source: Internet
Author: User
Tags http request lowercase solr string to file

SOLR is a stand-alone enterprise Search application server that provides API interfaces similar to Web-service. The user can submit an XML file of a certain format to the search engine server via an HTTP request, generate an index, or make a lookup request through an HTTP GET operation and get the returned result in Xml/json format. Developed using JAVA5, based on Lucene.

Lucene is a subproject of the Apache Software Foundation 4 Jakarta Project group, an open source full-Text Search engine toolkit, which is not a full-text search engine, but a full-text search engine architecture that provides a complete query engine and index engine. Part of the text analysis engine (English and German two Western languages).

One of the basic principles of Lucene full-text search, with Guo June Daniel's web search course in the same technology, using word segmentation, semantic grammar analysis, vector space model and other technologies to achieve, the following reprint a more detailed blog memo: http://www.cnblogs.com/ Guochunguang/articles/3641008.html

First, General

According to http://lucene.apache.org/java/docs/index.html definition:

Lucene is an efficient, Java-based full-text retrieval library.

So it takes a while to understand the full-text search before you understand Lucene.

So what is called full-text search? That starts with the data in our lives.

The data in our lives is generally divided into two categories: structured and unstructured data . structured data: refers to data that has a fixed or finite length, such as a database, metadata, and so on. unstructured data: refers to data that is indefinite or not in fixed format, such as mail, Word documents, and so on.

Of course, some places will also mention the third, semi-structured data, such as xml,html, when necessary according to the structure of data processing, can also extract the plain text to unstructured data processing.

unstructured data is also called full-text data.

According to the classification of data, search is divided into two kinds: the search for structured data : such as database search, with SQL statements. such as the search for metadata, such as the use of Windows Search for file name, type, modification time to search and so on. Search for unstructured data : Search by Windows can also search for file content, grep commands under Linux, and search for large amounts of content data with Google and Baidu.

There are two main ways to search for unstructured data, namely full-text data:

One is the sequential scan method (Serial scanning): so-called sequential scanning, such as to find content containing a string of files, is a document to see a document, for each document, see the tail from the beginning, if this document contains this string, then this document is the file we are looking for , and then look at the next file until all the files have been scanned. Search by Windows can also search the contents of a file, just quite slowly. If you have a 80G hard drive, if you want to find a file that contains a string of content on it, it doesn't take him a few hours, I'm afraid not. The grep command under Linux is also this way. You may think this method is relatively primitive, but for small data volume of the file, this method is the most direct, most convenient. But for a lot of files, this approach is slow.

One might say that the sequential scanning of unstructured data is slow, the search for structured data is relatively fast (because structured data has a certain structure can take a certain search algorithm to speed up), then the unstructured data to find a way to make a certain structure does not have it.

This idea is very natural, but it constitutes the basic idea of full-text retrieval, but also the unstructured data in the part of the information extracted, re-organized, so that it has a certain structure, and then a certain structure of the data to search, so as to achieve a relatively fast search.

This part extracts the information from unstructured data and then re-organizes it, which we call the index .

This is more abstract, a few examples are easy to understand, such as dictionaries, dictionaries and Radicals gept table is equivalent to the dictionary index, the interpretation of each word is non-structured, if the dictionary does not have a syllable table and radical gept table, in the vast Hanyu da zidian to find a word can only scan sequentially. However, some information of the word can be extracted out for structured processing, such as pronunciation, compared to the structure, the initials and the vowel, only a few can be enumerated, so the pronunciation out in a certain order, each pronunciation points to the word of the detailed explanation of the number of pages. We search by structured pinyin to find the pronunciation, and then according to the number of pages they point to, we can locate our unstructured data-that is, the interpretation of words.

The process of indexing first and then searching the index is called full-text search (Full-text search).

The following picture is from the Lucene in action, but does not merely describe the lucene retrieval process, but rather describes the general process of full-text retrieval.

The full-text search is divided into two processes, index creation (indexing) and search index. Index creation: The process of extracting information from all structured and unstructured data in the real world and creating an index. Search index: Is the process of getting the user's query request, searching for the created index, and then returning the result.

Therefore, there are three important problems in full-text search:

1. What is stored in the index? (Index)

2. How to create an index. (indexing)

3. How to search for an index. (Search)

Let's take a look at each of these issues in order.

second, what does the index have in store?

What exactly does the index need to save?

First, let's see why sequential scans are slow:

The fact is that the information we want to search is inconsistent with the information stored in unstructured data.

The information stored in unstructured data is the string that each file contains, known as a file, and the desire string is relatively easy, that is, a mapping from a file to a string. The information we want to search for is what files contain this string, which is known as a string, a desired file, or a mapping from a string to a file. They are the opposite. Therefore, if the index is always able to save the map from string to file, it will greatly improve the search speed.

Since the mapping from string to file is the reverse process of file-to-string mapping, the index that holds this information is called a reverse index .

The information stored in the reverse index is generally as follows:

Let's say that there are 100 documents in my document collection, and for the sake of convenience, we have a document numbering from 1 to 100 to get the following structure

The left-hand side holds a series of strings, called dictionaries .

Each string points to a linked list of documents (document) containing this string, which is called an inverted table (Posting list).

With the index, the saved information is consistent with the information to be searched, which can greatly speed up the search.

For example, if we are looking for a document that contains both the string "Lucene" and the string "SOLR", we only need the following steps:

1. Remove the list of documents containing the string "Lucene".

2. Remove the list of documents containing the string "SOLR".

3. By merging the linked list, find files that contain both "Lucene" and "SOLR".

Seeing this place, one might say that full-text search does speed up the search, but the process of indexing does not necessarily add up much faster than sequential scans. Indeed, with the indexing process, full-text retrieval is not necessarily faster than sequential scanning, especially when the amount of data is small. Creating an index on a very large amount of data is also a slow process.

However, there is a difference between the two, sequential scan is to scan every time, and the process of creating an index only need one time, and then once and for all, each search, the process of creating an index does not have to go through, just search for a good index to create.

This is also one of the advantages of full-text search over sequential scanning: One index, multiple use.

Iii. How to create an index

The index creation process for full-text indexing generally has the following steps:

The first step: some of the original documents to index (document).

For the convenience of explaining the index creation process, here are some examples of two files:

File One: Students should is allowed to go out with their friends, and not allowed to drink beer.

File two: My friend Jerry went to school to see he students but found them drunk which are not allowed.

Step Two: Pass the original document to the sub-component (Tokenizer).

Sub-phrase pieces (tokenizer) do the following things (this process is called tokenize):

1. Divide the document into a single word.

2. Remove punctuation.

3. Remove the Stop word (stop word).

The so-called Stop word (stop word) is one of the most common words in a language, because there is no special meaning, so in most cases cannot be the search keyword, so when creating an index, the word will be removed and reduce the size of the index.

English words (Stop word) such as: "The", "a", "this" and so on.

For each language's sub-phrase (tokenizer), there is a stop word (stop word) collection.

the result obtained after the word breaker (Tokenizer) is called the word element (Token).

In our example, we get the following word-element (Token):

"Students", "Allowed", "go", "their", "Friends", "allowed", "drink", "beer", "My", "friend", "Jerry", "went", "school", " See "," he "," students "," found "," them "," drunk "," allowed ".

Step Three: Pass the resulting word element (Token) to the Language processing component (linguistic Processor).

The language processing component (linguistic processor) mainly deals with language-related processing of the resulting lexical elements (tokens).

for English, the Language processing component (linguistic Processor) generally does the following:

1. Change to lowercase (lowercase).

2. Reduce the word to root form, such as "cars" to "car" and so on. This operation is called: stemming.

3. Turn the word into a root form, such as "drove" to "drive". This operation is called: Lemmatization.

similarities and differences between stemming and lemmatization: The same: both stemming and lemmatization have to make words into root forms. The two are different in the same way: stemming uses the "reduced" way: "Cars" to "car", "driving" to "drive". Lemmatization uses the "change" approach: "drove" to "drove", "driving" to "drive". The two algorithms are different: stemming is mainly to take some fixed algorithm to do this kind of reduction, such as removing "s", removing "ing" plus "e", will "ational" into "ate", will "tional" into "tion". Lemmatization mainly uses the method of preserving a dictionary to do this kind of transformation. For example, the dictionary has "driving" to "drive", "drove" to "drive", "AM, was, is" to "be" mapping, when making changes, just look up the dictionary. Stemming and lemmatization are not mutually exclusive relationships, there are intersections, and some words can achieve the same transformation by using both of these methods.

The result of the language processing component (linguistic processor) is called the word (term).

In our case, the word is given as follows:

' Student ', ' Allow ', ' go ', ' their ', ' friend ', ' Allow ', ' drink ', ' beer ', ' my ', ' friend ', ' Jerry ', ' Go ', ' school ', ' see ', ' his ' , "Student", "find", "them", "drink", "allow".

It is also because of the language processing steps to make the search drove, and drive can be searched out.

Fourth Step: Pass the resulting word (term) to the index component (Indexer).

the index component (Indexer) mainly does the following things:

1. Use the resulting word (term) to create a dictionary.

In our example, the dictionary is as follows:

Term

Document ID

Student

1

Allow

1

Go

1

Their

1

Friend

1

Allow

1

Drink

1

Beer

1

My

2

Friend

2

Jerry

2

Go

2

School

2

See

2

His

2

Student

2

Find

2

them

2

Drink

2

Allow

2

2. Sort the dictionaries in alphabetical order.

Term

Document ID

Allow

1

Allow

1

Allow

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.