Lucene: The basic principle of full-text search

Last Update:2018-07-19 Source: Internet

Author: User

Tags solr string to file

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Transferred from

Http://www.cnblogs.com/guochunguang/articles/3641008.html

First, General

According to http://lucene.apache.org/java/docs/index.html definition:

Lucene is an efficient, java-based full-text retrieval library.

So it takes a while to understand the full text search before you know about Lucene.

So what is called Full-text search. It's about the data in our lives.

There are two kinds of data in our life: structured data and unstructured data . structured data: a fixed-format or finite-length data, such as a database, metadata, and so on. unstructured data: A variable length or no fixed format of data, such as mail, Word documents, and so on.

Of course, some places will also mention the third, semi-structured data, such as xml,html, when the need can be processed according to structured data, can also extract the plain text by unstructured data to deal with.

unstructured data is called Full-text data.

According to the classification of data, search is divided into two types: the search for structured data : such as the search for the database, with SQL statements. such as the search for metadata, such as the use of Windows Search for file names, types, modify the time to search. Search for unstructured data : such as Windows Search can also search file content, Linux under the grep command, and then Google and Baidu can search for a large number of content data.

There are two main ways to search for unstructured data-full-text data:

One is sequential scanning (serial scanning): the so-called sequential scan, for example, to find the content contains a string of files, is a document a document of the look, for each document, see the end of the beginning, if this document contains this string, then this document for the file we are looking for , and then look at the next file until all the files are scanned. If you use Windows Search can also search the contents of the file, but rather slow. If you have a 80G hard drive, if you want to find a file on it that contains a string, it won't take him a few hours. The grep command under Linux is also this way. You may think this method is relatively primitive, but for small data files, this method is the most direct and convenient. But for a large number of files, this approach is slow.

One might say, sequential scanning of unstructured data is slow, but the search for structured data is relatively fast (because structured data has a certain structure to take a certain search algorithm to speed up), then our unstructured data to find ways to have a certain structure does not just do it.

This idea is natural, but it constitutes the basic idea of full-text search, but also the unstructured data in a part of the information extracted, re-organized, so that it has a certain structure, and then a certain structure of the data search, so as to achieve a relatively fast search purposes.

This section extracts from unstructured data and then organizes the information that we call the index .

This statement is more abstract, to give a few examples it is easy to understand, such as dictionaries, dictionaries and Radicals CJK ideographs table is equivalent to the index of the dictionary, the interpretation of each word is unstructured, if the dictionary does not have syllable table and radical CJK ideographs table, in the vast dictionary find a word can only be scanned sequentially. However, some of the word information can be extracted for structured processing, such as the pronunciation, the more structured, divided into initials and finals, only a few can be enumerated, so will be the pronunciation out in a certain order, each pronunciation points to the detailed explanation of the number of pages. We search by structured pinyin, and then we can find our unstructured data by the number of pages it points to-the interpretation of words.

The process of indexing first and then searching the index is called Full-text search (Full-text search).

The following diagram comes from the Lucene in action, but it does not only describe the retrieval process of Lucene, but describes the general process of full-text retrieval.

Full-text search is generally divided into two processes, index creation (indexing) and search index. Index creation: The process of extracting information from all structured and unstructured data in the real world and creating indexes. Search index: Is the process of getting the user's query request, searching for the index created, and then returning the result.

So there are three important questions in Full-text search:

1. What is stored in the index? (Index)

2. How to create an index. (indexing)

3. How to search for an index. (Search)

The following is a sequential study of each of these questions.

Ii. What exactly is stored in the index?

What exactly does the index need to save?

First, let's see why sequential scans are slow:

It's actually because the information we want to search for is inconsistent with the information stored in the unstructured data.

The information stored in unstructured data is the string that each file contains, known files, the relative ease of the desired string, or the mapping from file to string. And the information we want to search is which files contain this string, known strings, the desire file, the mapping from string to file. They are the opposite. Therefore, if the index can always save the mapping from string to file, it will greatly improve the search speed.

Because the mapping from string to file is the reverse process of file to string mapping, the index that holds this information is called a reverse index .

The information stored in the reverse index is generally as follows:

Assuming there are 100 documents in my document collection, for convenience, we get the following structure for document numbering from 1 to 100

On the left is a series of strings, called dictionaries .

Each string points to the document list containing this string, which is called the inverted table (Posting list).

With an index, the saved information is consistent with the information to be searched, which can greatly speed up the search.

For example, to find documents that contain both the string "Lucene" and the string "SOLR", we only need the following steps:

1. Remove the document list containing the string "Lucene".

2. Remove the document list containing the string "SOLR".

3. By merging the linked lists, find files that contain both "Lucene" and "SOLR".

When you see this place, one might say that Full-text search does speed up searching, but the process of indexing is not necessarily much faster than sequential scans. Indeed, in addition to the indexing process, full-text search is not necessarily faster than sequential scans, especially when the data volume is small. It is also a slow process to create an index on a large amount of data.

However, there is a difference between the two, sequential scan is every time to scan, and the process of creating an index only need once, and then once and for all, each search, the process of creating an index does not have to go through, just search to create a good index on it.

This is one of the advantages of Full-text search relative to sequential scanning: One index, multiple use.

Iii. How to create an index

The index creation process for Full-text indexing typically has the following steps: The first step: Some original documents to index (document).

For ease of reference to the index creation process, here are two files for example:

File One: Students should is allowed to go out with their friends, but not allowed to drink.

File two: My friend Jerry went to school to the his students but found them drunk which are not allowed.

Step Two: Pass the original document to the Sub Component (Tokenizer).

the sub-phrase (Tokenizer) does the following things (this process is called tokenize):

1. Separate the document into a single word.

2. Remove punctuation marks.

3. Remove the Stop word (stop word).

Stop Word is one of the most common words in a language, because there is no special meaning, so most of the case can not become the keyword search, so when creating an index, this word will be removed and reduce the size of the index.

English in the word (Stop word) such as: "The", "a", "this" and so on.

For each language, there is a Stop Word collection (tokenizer) for each phrase.

After participle (tokenizer) The result is called the word element (Token).

In our example, we get the following word elements (Token):

"Students", "Allowed", "go", "their", "Friends", "allowed", "drink", "beer", "my", "friend", "Jerry", "went", "school", " "", "his", "Students", "found", "them", "drunk", "allowed".

Step Three: Pass the resulting word element (Token) to the Language processing component (linguistic Processor).

The language processing component (linguistic processor) is mainly to do some language-related processing to the obtained lexical element (Token).

for English, the Language processing component (linguistic Processor) generally does the following:

1. Into lowercase (lowercase).

2. Reduce the word to root form, such as "cars" to "car" and so on. This operation is called: stemming.

3. Converts words into root forms, such as "drove" to "drive". This operation is called: Lemmatization.

similarities and differences between stemming and lemmatization: the same: stemming and lemmatization have to make words the root form. The two ways are different: stemming uses the "reduced" way: "Cars" to "car", "driving" to "drive". Lemmatization adopts the way of "transformation": "Drove" to "drove", "driving" to "drive". The two algorithms are different: stemming mainly take some fixed algorithm to do this reduction, such as removing "s", removing "ing" plus "e", "ational" into "ate", "tional" into "tion". The main reason for lemmatization is to make this transition in a way that preserves a dictionary. For example, the dictionary has "driving" to "drive", "drove" to "drive", "AM, is, are" to "be" mapping, when making changes, just look up the dictionary. Stemming and lemmatization are not mutually exclusive relationships, there are intersections, and some words can be used in both ways to achieve the same transformation.

the result of a language processing component (linguistic processor) is called a word (Term).

In our case, the words processed by the language (Term) are as follows:

"Student", "Allow", "go", "their", "friend", "Allow", "drink", "beer", "I", "friend", "Jerry", "Go", "school", "You", "his" , "Student", "find", "them", "drink", "allow".

It is also because of the process of language processing to make search drove, and drive can also be searched out.

Step Fourth: Pass the resulting word (Term) to the index component (Indexer).

The index component (Indexer) mainly does the following things:

1. Create a dictionary using the resulting word (Term).

In our example, the dictionary is as follows:

Term	Document ID
Student	1
Allow	1
Go	1
Their	1
Friend	1
Allow	1
Drink	1
Beer	1
My	2
Friend	2
Jerry	2
Go	2
School	2
The	2
His	2
Student	2
Find	2
them	2
Drink	2
Allow	2

2. The dictionary is sorted in alphabetical order.

Term	Document ID
Allow	1
Allow	1
Allow	2
Beer	1
Drink	1
Drink	2
Find	2
Friend	1
Friend	2
Go	1
Go	2
His	2
Jerry	2
My	2
School	2
The	2
Student	1
Student	2
Their	1
them	2

3. Merging the same words (Term) becomes the document inverted (Posting list) list.

In this table, there are several definitions: document Frequency is the documentation frequency, indicating how many files contain the word (Term) in total. Frequency is the word frequency, which means that the file contains several of the words (Term).

So in terms of the word (Term) "Allow", a total of two documents contain this word (Term), so there are two total document lists following the word (Term), the first one represents the first document containing "Allow", or document 1th, in which "allow" appears 2 times, The second item, which contains the second document containing "Allow", is document number 2nd, in which "allow" appears 1 times.

So far, the index has been created, and we can quickly find the document we want through it.

And in this process, we are pleasantly surprised to find that search "drive", "driving", "drove", "driven" can also be found. Because in our index, "driving", "drove", "driven" will be processed by the language into the "drive", in the search, if you enter "driving", the input of the query statement also passed us here one to three steps, thus becoming a query "drive", So that you can search for the document you want.

third, how to search the index.

It seems that we can announce "We found the document we want".

But it's not over, it's just one aspect of Full-text search. Isn't it. If only one or 10 documents contain our query string, we did find it. But if there are 1000 or even tens of thousands of them. That's the file you want most.

Open Google, say you want to find a job in Microsoft, so you enter "Microsoft Job", you find that there are 22.6 million results returned. Good big number, suddenly found that is not found is a problem, find too much is a problem. In so many results, how to put the most relevant to the front.

Of course Google did a good job and you found jobs at Microsoft. Imagine how terrible it would be if the first few were all "Microsoft does a good job at software industry ...".

How to find the most relevant to a query in thousands of search results, like Google.

How to determine the relevance of the search documents and query statements?

This goes back to our third question: How to search for an index.

Search is mainly divided into the following steps: The first step: User input query statement.

The query statement, like our normal language, has a certain syntax.

Different query statements have different syntax, such as SQL statements have a certain syntax.

The syntax of the query statement differs according to the implementation of the Full-text retrieval system. The basics are: and, or, not.

For instance, user input statements: Lucene and learned not Hadoop.

Indicates that the user wants to find a document that contains Lucene and learned but does not include Hadoop. The second step: lexical analysis, grammatical analysis, and language processing for the query statement.

Because of the syntax of the query statement, grammar analysis, grammar analysis and language processing are also necessary.

1. Lexical analysis is mainly used to identify words and keywords.

As in the above example, after lexical analysis, the word has lucene,learned,hadoop, the keyword has and, not.

If an illegal keyword is found in the lexical analysis, an error occurs. such as the Lucene AMD learned, where due to and misspelled, led to AMD as an ordinary word to participate in the query.

2. Grammar analysis is mainly based on the grammatical rules of the query statement to form a syntax tree.

If you find that the query statement does not meet the syntax rules, an error occurs. such as Lucene not and learned, an error occurs.

As the above example, the syntax tree of Lucene and learned not Hadoop is as follows:

3. Language processing is almost identical to the language processing in the indexing process.

such as learned into learn.

After the second step, we get a language-processed syntax tree.

Step Three: Search the index to get a document that conforms to the syntax tree.

This step has several small steps: First, in the Reverse Index table, identify the document list that contains the Lucene,learn,hadoop. Secondly, the linked list containing Lucene,learn is merged, and a document list containing both Lucene and learn is obtained. The linked list is then manipulated against Hadoop's document list to remove documents containing Hadoop, resulting in document lists that contain both Lucene and learn and do not contain Hadoop. This document list is the document we are looking for.

Fourth step: According to the resulting documents and query statements of relevance, the results are sorted.

Although in the previous step, we got the desired document, however, the query results should be sorted according to the relevance of the query statement, the more relevant people to the front.

How do I calculate the relevance of a document and a query statement?

Instead of looking at the query as a short document, scoring (relevance) the correlation between documents and documents (scoring), a good correlation between the scores and the high score should be in the front.

So how do you rate the relationship between documents?

This is not an easy thing, first of all, let's take a look at the relationship between people.

first look at a person, often have many elements , such as personality, beliefs, hobbies, clothing, height, fat and thin and so on.

second , the relationship between people, different factors of importance , personality, beliefs, hobbies may be important, clothing, height, fat and thin may not be so important, so have the same or similar character, faith, hobby people more easily become good friends, however, clothing, People who are tall, fat, and thin can also be good friends.

Therefore, to judge the relationship between people, first of all to find out which elements of the relationship between people the most important , such as personality, beliefs, hobbies. second, to judge the relationship between these elements of two people , such as a person with a cheerful personality, another person's extrovert, a person to believe in Buddhism, another belief in God, a person likes playing basketball, another hobby playing football. We found that two people are very positive in character, faith is very good, love sports, so two people should be very good relationship.

Let's take a look at the relationship between the companies.

first look at a company, there are a lot of people, such as general manager, manager, CTO, General Staff, security, doorman and so on.

second, the relationship between the company and the company, different people of different importance , General manager, manager, CTO may be more important, ordinary staff, security, doorman may be less important. So if two of the company general manager, manager, CTO relationship between the better, the two companies easy to have a better relationship. However, even if an ordinary employee has a xiehaishenqiu with an ordinary employee in another company, it is difficult to affect the relationship between the two companies.

Therefore, to determine the relationship between the company and the company, first of all to find out who is the most important relationship between the company and the company, such as General manager, manager, chief technology officer. The second is to judge the relationship between these people , as the general manager of two companies once was a classmate, the manager is a fellow, the chief technology officer was a pioneering partner. We found that two companies, whether General manager, manager or CTO, had a good relationship, so two companies would be good.

After analyzing the two relationships, let's look at how to determine the relationship between the documents .

First, a document consists of many words (Term) , such as search, Lucene, Full-text, this, a, what, etc.

second, for the relationship between documents, different term importance is different , for example, for this document, search, Lucene, Full-text is relatively important some, this, a, what may be relatively unimportant one

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More