A summary of Lucene learning: The Fundamentals of full-text retrieval

Source: Internet
Author: User
Tags solr string to file

First, General

According to http://lucene.apache.org/java/docs/index.html definition:

Lucene is an efficient, Java-based full-text retrieval library.

So it takes a while to understand the full-text search before you understand Lucene.

So what is called full-text search? That starts with the data in our lives.

The data in our lives is generally divided into two categories: structured and unstructured data .

    • structured data: refers to data that has a fixed or finite length, such as a database, metadata, and so on.
    • unstructured data: refers to data that is indefinite or not in fixed format, such as mail, Word documents, and so on.

Of course, some places will also mention the third, semi-structured data, such as xml,html, when necessary according to the structure of data processing, can also extract the plain text to unstructured data processing.

Unstructured data is also called full-text data.

According to the classification of data, search is divided into two kinds:

    • Search for structured data : such as searching for a database, using SQL statements. such as the search for metadata, such as the use of Windows Search for file name, type, modification time to search and so on.
    • Search for unstructured data : Search by Windows can also search for file content, grep commands under Linux, and search for large amounts of content data with Google and Baidu.

There are two main ways to search for unstructured data, namely full-text data:

One is the sequential scan method (Serial scanning): so-called sequential scanning, such as to find content containing a string of files, is a document to see a document, for each document, see the tail from the beginning, if this document contains this string, then this document is the file we are looking for , and then look at the next file until all the files have been scanned. Search by Windows can also search the contents of a file, just quite slowly. If you have a 80G hard drive, if you want to find a file that contains a string of content on it, it doesn't take him a few hours, I'm afraid not. The grep command under Linux is also this way. You may think this method is relatively primitive, but for small data volume of the file, this method is the most direct, most convenient. But for a lot of files, this approach is slow.

One might say that the sequential scanning of unstructured data is slow, the search for structured data is relatively fast (because structured data has a certain structure can take a certain search algorithm to speed up), then the unstructured data to find a way to make a certain structure is not the right?

This idea is very natural, but it constitutes the basic idea of full-text retrieval, but also the unstructured data in the part of the information extracted, re-organized, so that it has a certain structure, and then a certain structure of the data to search, so as to achieve a relatively fast search.

This part extracts the information from unstructured data and then re-organizes it, which we call the index .

This is more abstract, a few examples are easy to understand, such as dictionaries, dictionaries and Radicals gept table is equivalent to the dictionary index, the interpretation of each word is non-structured, if the dictionary does not have a syllable table and radical gept table, in the vast Hanyu da zidian to find a word can only scan sequentially. However, some information of the word can be extracted out for structured processing, such as pronunciation, compared to the structure, the initials and the vowel, only a few can be enumerated, so the pronunciation out in a certain order, each pronunciation points to the word of the detailed explanation of the number of pages. We search by structured pinyin to find the pronunciation, and then according to the number of pages they point to, we can locate our unstructured data-that is, the interpretation of words.

The process of indexing first and then searching the index is called full-text search (Full-text search).

The following picture is from the Lucene in action, but does not merely describe the lucene retrieval process, but rather describes the general process of full-text retrieval.

The full-text search is divided into two processes, index creation (indexing) and search index.

    • Index creation: The process of extracting information from all structured and unstructured data in the real world and creating an index.
    • Search index: Is the process of getting the user's query request, searching for the created index, and then returning the result.

Therefore, there are three important problems in full-text search:

1. What does the index store? (Index)

2. How do I create an index? (indexing)

3. How do I search for an index? (Search)

Let's take a look at each of these issues in order.

Second, what does the index have in store?

What exactly does the index need to save?

First, let's see why sequential scans are slow:

The fact is that the information we want to search is inconsistent with the information stored in unstructured data.

The information stored in unstructured data is the string that each file contains, known as a file, and the desire string is relatively easy, that is, a mapping from a file to a string. The information we want to search for is what files contain this string, which is known as a string, a desired file, or a mapping from a string to a file. They are the opposite. Therefore, if the index is always able to save the map from string to file, it will greatly improve the search speed.

Since the mapping from string to file is the reverse process of file-to-string mapping, the index that holds this information is called a reverse index .

The information stored in the reverse index is generally as follows:

Let's say that there are 100 documents in my document collection, and for the sake of convenience, we have a document numbering from 1 to 100 to get the following structure

The left-hand side holds a series of strings, called dictionaries .

Each string points to a linked list of documents (document) containing this string, which is called an inverted table (Posting list).

With the index, the saved information is consistent with the information to be searched, which can greatly speed up the search.

For example, if we are looking for a document that contains both the string "Lucene" and the string "SOLR", we only need the following steps:

1. Remove the list of documents containing the string "Lucene".

2. Remove the list of documents containing the string "SOLR".

3. By merging the linked list, find files that contain both "Lucene" and "SOLR".

Seeing this place, one might say that full-text search does speed up the search, but the process of indexing does not necessarily add up much faster than sequential scans. Indeed, with the indexing process, full-text retrieval is not necessarily faster than sequential scanning, especially when the amount of data is small. Creating an index on a very large amount of data is also a slow process.

However, there is a difference between the two, sequential scan is to scan every time, and the process of creating an index only need one time, and then once and for all, each search, the process of creating an index does not have to go through, just search for a good index to create.

This is also one of the advantages of full-text search over sequential scanning: One index, multiple use.

Iii. How to create an index

The index creation process for full-text indexing generally has the following steps:

The first step: some of the original documents to index (document).

For the convenience of explaining the index creation process, here are some examples of two files:

File One: Students should is allowed to go out with their friends, and not allowed to drink beer.

File two: My friend Jerry went to school to see he students but found them drunk which are not allowed.

Step Two: Pass the original document to the sub-component (Tokenizer).

Sub-phrase pieces (tokenizer) do the following things (this process is called tokenize):

1. Divide the document into a single word.

2. Remove punctuation.

3. Remove the Stop word (stop word).

The so-called Stop word (stop word) is one of the most common words in a language, because there is no special meaning, so in most cases cannot be the search keyword, so when creating an index, the word will be removed and reduce the size of the index.

English words (Stop word) such as: "The", "a", "this" and so on.

For each language's sub-phrase (tokenizer), there is a stop word (stop word) collection.

The result obtained after the word breaker (Tokenizer) is called the word element (Token).

In our example, we get the following word-element (Token):

"Students", "Allowed", "go", "their", "Friends", "allowed", "drink", "beer", "My", "friend", "Jerry", "went", "school", " See "," he "," students "," found "," them "," drunk "," allowed ".

Step Three: Pass the resulting word element (Token) to the Language processing component (linguistic Processor).

The language processing component (linguistic processor) mainly deals with language-related processing of the resulting lexical elements (tokens).

For English, the Language processing component (linguistic Processor) generally does the following:

1. Change to lowercase (lowercase).

2. Reduce the word to root form, such as "cars" to "car" and so on. This operation is called: stemming.

3. Turn the word into a root form, such as "drove" to "drive". This operation is called: Lemmatization.

Similarities and differences of stemming and lemmatization:

    • Similarities: Both stemming and lemmatization have to make words into root forms.
    • The two are in different ways:
      • Stemming uses a "reduced" approach: "Cars" to "car", "driving" to "drive".
      • Lemmatization uses the "change" approach: "drove" to "drove", "driving" to "drive".
    • The two algorithms are different:
      • Stemming mainly adopts some fixed algorithm to do this kind of reduction, such as removing "s", removing "ing" plus "e", turning "ational" into "ate" and "tional" into "tion".
      • Lemmatization mainly uses the method of preserving a dictionary to do this kind of transformation. For example, the dictionary has "driving" to "drive", "drove" to "drive", "AM, was, is" to "be" mapping, when making changes, just look up the dictionary.
    • Stemming and lemmatization are not mutually exclusive relationships, there are intersections, and some words can achieve the same transformation by using both of these methods.

The result of the language processing component (linguistic processor) is called the word (term).

In our case, the word is given as follows:

' Student ', ' Allow ', ' go ', ' their ', ' friend ', ' Allow ', ' drink ', ' beer ', ' my ', ' friend ', ' Jerry ', ' Go ', ' school ', ' see ', ' his ' , "Student", "find", "them", "drink", "allow".

It is also because of the language processing steps to make the search drove, and drive can be searched out.

Fourth Step: Pass the resulting word (term) to the index component (Indexer).

The index component (Indexer) mainly does the following things:

1. Use the resulting word (term) to create a dictionary.

In our example, the dictionary is as follows:

Term Document ID
Student 1
Allow 1
Go 1
Their 1
Friend 1
Allow 1
Drink 1
Beer 1
My 2
Friend 2
Jerry 2
Go 2
School 2
See 2
His 2
Student 2
Find 2
them 2
Drink 2
Allow 2

2. Sort the dictionaries in alphabetical order.

Term Document ID
Allow 1
Allow 1
Allow 2
Beer 1
Drink 1
Drink 2
Find 2
Friend 1
Friend 2
Go 1
Go 2
Fi= 2
Jerry 2
My 2
School 2
See 2
Student 1
Student 2
Their 1
them 2

3. Merge the same word (term) into the document inverted (Posting list) linked list.

In this table, there are several definitions:

    • Document Frequency is the frequency of documents, which indicates how many files contain this term in total.
    • Frequency is the word frequency, which indicates that the file contains several of these words.

So for word "allow", a total of two documents contain this word (term), so that there are two items in the list of documents that follow the words (term), the first of which represents the first document containing "Allow", the 1th document, where "allow" appears 2 times, The second item represents the second document that contains "allow", which is document 2nd, where "Allow" appears 1 times.

So far, the index has been created so that we can quickly find the document we want.

And in the process, we are pleasantly surprised to find that search "drive", "driving", "drove", "driven" can also be searched. Because in our index, "driving", "drove", "driven" will be the language processing and become "drive", when searching, if you enter "driving", the input query statement also passes our here one to three steps, thus becomes the query "drive", So you can search for the document you want.

Third, how to search the index?

It seems that we can announce that we have found the documents we want.

But it's not over, and it's just one aspect of full-text search. Isn't it? If only one or 10 documents contain the string of our query, we did find it. But what if there are 1000 or even thousands of results? Is that the file you want most?

Open Google, say you want to find a job at Microsoft, so you enter "Microsoft Job", you find a total of 22.6 million results returned. A big number, suddenly found to be a problem, found too much is also a problem. In so many results, how to put the most relevant in the front?

Of course Google did a great job, and you found jobs at Microsoft. Imagine how horrible it would be if the first few were all "Microsoft does a good job at software industry ...".

How do you find the most relevant query statements in thousands of search results, like Google?

How do you determine the relevance of a search for a document and a query statement?

This is going back to our third question: How do I search for an index?

Search is mainly divided into the following steps:

The first step: The user enters a query statement.

The query statement is the same as our common language, also has certain grammar.

Different query statements have different syntax, such as SQL statements that have a certain syntax.

The syntax of a query statement differs based on the implementation of the full-text retrieval system. The most basic examples are: and, OR, not.

For example, a user input statement: Lucene and learned not Hadoop.

Indicates that the user is looking for a document that contains Lucene and learned but does not include Hadoop.

The second step: lexical analysis, grammar analysis, and language processing of query statements.

Because the query statement has syntax, it is also necessary to do syntax analysis, grammar analysis and language processing.

1. Lexical analysis is mainly used to identify words and keywords.

In the above example, after lexical analysis, the word has lucene,learned,hadoop, the key word has and, not.

If an illegal keyword is found in the lexical analysis, an error occurs. such as Lucene AMD learned, where due to and misspelled, led to AMD as a common word to participate in the query.

2. Grammatical analysis is mainly based on the syntax rules of query statements to form a syntax tree.

If the query statement does not meet the syntax rules, an error is found. If Lucene is not and learned, an error occurs.

As the above example, Lucene and learned not Hadoop form the syntax tree as follows:

3. Language processing is almost identical to the language processing in the indexing process.

such as learned into learn and so on.

After the second step, we get a language-processed syntax tree.

The third step: Search index, get the document that conforms to the syntax tree.

This step has several small steps:

    1. First, in the Reverse Index table, find the list of documents that contain Lucene,learn,hadoop, respectively.
    2. Secondly, the chain list containing Lucene,learn is merged, and a list of documents containing both Lucene and learn is obtained.
    3. This linked list is then poorly manipulated with the Hadoop document list, removing the document containing Hadoop, resulting in a list of documents that contain both Lucene and learn and that do not contain Hadoop.
    4. This list of documents is the document we are looking for.

Fourth Step: The results are sorted according to the relevance of the obtained documents and query statements.

Although in the previous step, we got the desired document, however, the query results should be sorted according to the relevance of the query statement, the more relevant people on the front.

How do you calculate the relevance of a document and a query statement?

We should consider the query statement as a short document, the correlation between the document and the document (relevance) scored (scoring), the high score of relevance, it should be in front.

So how do you rate the relationship between documents?

This is not an easy thing to do, first let's look at the relationship between people.

first look at a person, often have many elements , such as personality, beliefs, hobbies, clothing, height, fat and thin and so on.

secondly , the relationship between people, different elements of different importance , personality, beliefs, hobbies may be important, clothing, height, fat and thin may not be so important, so have the same or similar personality, beliefs, hobbies are more likely to become good friends, but clothing, Height, fat and thin different people, can also become good friends.

Thus judging the relationship between people, first of all to find out which elements of the relationship between people the most important , such as personality, beliefs, hobbies. The second is to determine the relationship between these elements of two people , such as a person's cheerful personality, another personality extroversion, one person to believe in Buddhism, another faith in God, one likes to play basketball, another hobby to play football. We found that two people in the character are very positive, the belief is very good, loving aspects are love sports, so two people relationship should be very good.

Let's take a look at the relationship between companies.

first look at a company, there are a lot of people, such as general manager, manager, chief Technical officer, General Staff, security, doorman and so on.

Secondly, the relationship between the company and the company, different people of different importance , General manager, manager, chief technical officer may be more important, general staff, security, doorman may be less important. So if the relationship between the two general managers, the manager, the chief technical officer is better, the two companies tend to have better relationships. However, it is difficult for an ordinary employee to blood feud the relationship between two companies, even if it is a common employee of another company.

Thus judging the relationship between the company and the company, first of all to find out who is the most important relationship between the company and the company, such as General manager, manager, chief technical officer. second, to judge the relationship between these people , the two company's general manager was a classmate, the manager is a fellow, the chief technical officer was a venture partner. We found that the relationship between the two companies, the general manager, the manager, the CTO, was good, and the two companies would be well-off.

Analyze the relationship between the two, and see How to determine the relationship between documents .

First, a document is composed of many words (term) , such as search, Lucene, Full-text, this, a, what, and so on.

second, for the relationship between documents, different term importance is different , such as for this document, search, Lucene, Full-text is relatively important, this, a, what may be relatively unimportant. So if two documents contain search, Lucene,fulltext, these two documents have a better correlation, but even if a document contains this, a, what, and the other document does not contain this, a, what, it does not affect the relevance of the two documents.

Thus judging the relationship between documents, first find out which words (term) are most important to the relationship between documents, such as search, Lucene, fulltext. Then judge the relationship between these words.

The process of finding out the importance of a word (term) for a document is called the process of calculating the weight of a word (term weight).

The weight of the calculated word (term weight) has two parameters, the first one is the word (term), and the second is the document.

The weight of the word (term weight) indicates the importance of this word (term) in this document, the more important the word (term) has the greater the weight (terms weight), and thus the correlation between the calculated documents will play a greater role.

The process of judging the relationship between words (term) to get the document relevance is to apply an algorithm called vector space model.

Here's a closer look at these two processes:

1. The process of calculating weights (term weight).

There are two main factors that affect the importance of a word in a document:

    • Term Frequency (TF): That is, how many times this term appears in this document. The larger the TF, the more important the explanation.
    • Document Frequency (DF): That is, how many documents contain the second term. The larger the DF, the less important it is.

Is it easy to understand? The more times a word (term) appears in a document, the more important this term is to the document, such as the word "search", which appears in this document for a number of times, explaining that this is mainly about this. However, in an English document, this appears more frequently, which means the more important it is? No, this is adjusted by the second factor, and the second factor shows that the more documents contain the term, the more common the term is, the less important it is to differentiate the documents from each other.

This is also the technology that our programmers learn, for the programmer itself, the deeper the technology, the better (the deeper the mastery, the more you take the time to see, the greater the TF), the more competitive in finding a job. For all programmers, however, the less skilled the technology is, the better (knowing less DF), the more competitive it is to find a job. The value of man lies in the fact that irreplaceable is the truth.

The truth is clear, let's look at the formula:

This is only a simple typical implementation of the term weight calculation formula. People who implement full-text retrieval systems will have their own implementations, and Lucene is slightly different.

2. The process of judging the relationship between the term to get the document relevance, also known as the vector space Model algorithm (VSM).

We think of the document as a series of words, each of which has a weight (term weight), and different words (term) affect the score calculation of the document relevance according to their weight in the document.

So we think of all the morphemes (term weight) weights of this document as a vector.

Document = {Term1, Term2, ..., term N}

Document Vector = {weight1, weight2, ..., Weight N}

Similarly, we think of query statements as a simple document, as well as vectors.

Query = {Term1, term 2, ..., term N}

Query Vector = {weight1, weight2, ..., Weight N}

We put all the searched document vectors and query vectors into an n-dimensional space, each word being one dimension.

We think the smaller the angle between the two vectors, the greater the correlation.

So we calculate the cosine of the angle as the correlation score, the smaller the angle, the greater the cosine, the higher the score, the greater the correlation.

One might ask that query statements are generally very short and contain very few words (term), so that the dimensions of the query vectors are very small, the document is very long, contains many words (term), and the document vector dimension is large. How are the dimensions of both of your graphs n?

Here, since we want to put the same vector space, the natural dimension is the same, not at the same time, take the combination of the two, if it does not contain a word (term), then the weight (terms Weight) is 0.

The correlation scoring formula is as follows:

For example, the query statement has 11 term, a total of three documents to search out. Among the respective weights (term weight), the following table.

T1

T2

T3

T4

T5

T6

T7

T8

T9

T10

T11

D1

0

. 477

. 477

176

0

0

. 176

D2

. 176

0

. 477

0

0

0

. 954

0

. 176

D3

. 176

0

0

0

176

0

0

0

. 176

. 176

Q

0

0

0

0

0

.176

0

0

.477

0

.176

Thus, the correlations between the three documents and the query statements were calculated as:

So the document has the highest correlation, first returns, followed by document one, and finally document three.

So far, we can find the documents we want most.

Speaking so much, in fact, has not entered Lucene, but is only information retrieval technology (information Retrieval) basic theory, but when we look at Lucene, we will find that Lucene is a basic theory of this basic practice. Therefore, in later analysis Lucene's article, will often see the above theory in Lucene application.

A summary of the above index creation and search process before entering Lucene,

This figure refers to the article "open source full-text search engine Lucene" in http://www.lucene.com.cn/about.htm

1. Indexing process:

1) There is a series of indexed files

2) The indexed file is parsed and the language processed to form a series of words (term).

3) The index is created to form a dictionary and a reverse index table.

4) write the index to the hard disk through the index store.

2. Search process:

A) User input query statement.

b) A series of words (term) are obtained from the syntax analysis and linguistic analysis of query statements.

c) A query tree is obtained by parsing the syntax.

d) Read the index into memory through the index store.

e) Use the query tree to search the index, so that each word (term) of the document linked list, the document linked to the table, poor, and get the results of the document.

f) Sorts the query's relevance to the search results document.

g) Returns the result of the query to the user.


A summary of Lucene learning: The Fundamentals of full-text retrieval

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.