About Lucene Getting Started

Source: Internet
Author: User
Tags lowercase solr string to file
First, General

According to http://lucene.apache.org/java/docs/index.html definition:

Apache Lucene (TM) is a high-performance, full-featured text search engine library written entirely in Java. It's a technology suitable for nearly any application that requires Full-text search, especially cross-platform. "

Lucene is an efficient, Java-based full-text retrieval library.

So it takes a while to understand the full-text search before you understand Lucene.

So what is called full-text search? That starts with the data in our lives.

The data in our lives is generally divided into two categories: structured and unstructured data. Structured data: Refers to data that has a fixed or finite length, such as a database, metadata, and so on. Unstructured data: Refers to data that is indefinite or not in fixed format, such as mail, Word documents, and so on.

Of course, some places will also mention the third, semi-structured data, such as xml,html, when necessary according to the structure of data processing, can also extract the plain text to unstructured data processing.

Unstructured data is also called full-text data.

According to the classification of data, search is divided into two kinds: the search for structured data: such as database search, with SQL statements. such as the search for metadata, such as the use of Windows Search for file name, type, modification time to search and so on. Search for unstructured data: Search by Windows can also search for file content, grep commands under Linux, and search for large amounts of content data with Google and Baidu.

There are two main ways to search for unstructured data, namely full-text data:

One is the sequential scan method (Serial scanning): so-called sequential scanning, such as to find content containing a string of files, is a document to see a document, for each document, see the tail from the beginning, if this document contains this string, then this document is the file we are looking for, and then see the next file Until all the files have been scanned. Search by Windows can also search the contents of a file, just quite slowly. If you have a 80G hard drive, if you want to find a file that contains a string of content on it, it doesn't take him a few hours, I'm afraid not. The grep command under Linux is also this way. You may think this method is relatively primitive, but for small data volume of the file, this method is the most direct, most convenient. But for a lot of files, this approach is slow.

One might say that the sequential scanning of unstructured data is slow, the search for structured data is relatively fast (because structured data has a certain structure can take a certain search algorithm to speed up), then the unstructured data to find a way to make a certain structure does not have it.

This idea is very natural, but it constitutes the basic idea of full-text retrieval, but also the unstructured data in the part of the information extracted, re-organized, so that it has a certain structure, and then a certain structure of the data to search, so as to achieve a relatively fast search.

This part extracts the information from unstructured data and then re-organizes it, which we call the index.

This is more abstract, a few examples are easy to understand, such as dictionaries, dictionaries and Radicals gept table is equivalent to the dictionary index, the interpretation of each word is non-structured, if the dictionary does not have a syllable table and radical gept table, in the vast Hanyu da zidian to find a word can only scan sequentially. However, some information of the word can be extracted out for structured processing, such as pronunciation, compared to the structure, the initials and the vowel, only a few can be enumerated, so the pronunciation out in a certain order, each pronunciation points to the word of the detailed explanation of the number of pages. We search by structured pinyin to find the pronunciation, and then according to the number of pages they point to, we can locate our unstructured data-that is, the interpretation of words.

The process of indexing first and then searching the index is called full-text search (Full-text search).

The following picture is from the Lucene in action, but does not merely describe the lucene retrieval process, but rather describes the general process of full-text retrieval.

The full-text search is divided into two processes, index creation (indexing) and search index. Index creation: The process of extracting information from all structured and unstructured data in the real world and creating an index. Search index: Is the process of getting the user's query request, searching for the created index, and then returning the result.

Therefore, there are three important problems in full-text search:

1. What is stored in the index? (Index)

2. How to create an index. (indexing)

3. How to search for an index. (Search)

Let's take a look at each of these issues in order.

second, what does the index have in store?

What exactly does the index need to save?

First, let's see why sequential scans are slow:

The fact is that the information we want to search is inconsistent with the information stored in unstructured data.

The information stored in unstructured data is the string that each file contains, known as a file, and the desire string is relatively easy, that is, a mapping from a file to a string. The information we want to search for is what files contain this string, which is known as a string, a desired file, or a mapping from a string to a file. They are the opposite. Therefore, if the index is always able to save the map from string to file, it will greatly improve the search speed.

Since the mapping from string to file is the reverse process of file-to-string mapping, the index that holds this information is called a reverse index.

The information stored in the reverse index is generally as follows:

Let's say that there are 100 documents in my document collection, and for the sake of convenience, we have a document numbering from 1 to 100 to get the following structure

The left-hand side holds a series of strings, called dictionaries.

Each string points to a linked list of documents (document) containing this string, which is called an inverted table (Posting list).

With the index, the saved information is consistent with the information to be searched, which can greatly speed up the search.

For example, if we are looking for a document that contains both the string "Lucene" and the string "SOLR", we only need the following steps:

1. Remove the list of documents containing the string "Lucene".

2. Remove the list of documents containing the string "SOLR".

3. By merging the linked list, find files that contain both "Lucene" and "SOLR".

Seeing this place, one might say that full-text search does speed up the search, but the process of indexing does not necessarily add up much faster than sequential scans. Indeed, with the indexing process, full-text retrieval is not necessarily faster than sequential scanning, especially when the amount of data is small. Creating an index on a very large amount of data is also a slow process.

However, there is a difference between the two, sequential scan is to scan every time, and the process of creating an index only need one time, and then once and for all, each search, the process of creating an index does not have to go through, just search for a good index to create.

This is also one of the advantages of full-text search over sequential scanning: One index, multiple use.

Iii. How to create an index

The index creation process for full-text indexing generally has the following steps: The first step: some original documents to index.

For the convenience of explaining the index creation process, here are some examples of two files:

File One: Students should is allowed to go out with their friends, and not allowed to drink beer.

File two: My friend Jerry went to school to see he students but found them drunk which are not allowed.

Step Two: Pass the original document to the word breaker (tokenizer).

The word breaker (tokenizer) does several things (this process is called tokenize):

1. Divide the document into a single word.

2. Remove punctuation.

3. Remove the Stop word (stop word).

The so-called Stop word (stop word) is one of the most common words in a language, because there is no special meaning, so in most cases cannot be the search keyword, so when creating an index, the word will be removed and reduce the size of the index.

English stop word (stop word) such as: "The", "a", "this" and so on.

For each language's sub-phrase (tokenizer), there is a stop word (stop word) collection.

The result obtained after the word breaker (Tokenizer) is called the word element (Token).

In our example, we get the following word-element (Token):

"Students", "Allowed", "go", "their", "Friends", "allowed", "drink", "beer", "My", "friend", "Jerry", "went", "school", " See "," he "," students "," found "," them "," drunk "," allowed ".

Step Three: Pass the resulting word element (Token) to the Language processing component (linguistic Processor).

The language processing component (linguistic processor) mainly deals with language-related processing of the resulting lexical elements (tokens).

For English, the Language processing component (linguistic Processor) generally does the following:

1. Change to lowercase (lowercase).

2. Reduce the word to root form, such as "cars" to "car" and so on. This operation is called: stemming.

3. Turn the word into a root form, such as "drove" to "drive". This operation is called: Lemmatization.

Similarities and differences between stemming and lemmatization: the same: Both stemming and lemmatization have to make words into root forms. The two are different in the same way: stemming uses the "reduced" way: "Cars" to "car", "driving" to "drive". Lemmatization uses the "change" approach: "drove" to "drove", "driving" to "drive". The two algorithms are different: stemming is mainly to take some fixed algorithm to do this kind of reduction, such as removing "s", removing "ing" plus "e", will "ational" into "ate", will "tional" into "tion". Lemmatization mainly uses the method of preserving a dictionary to do this kind of transformation. For example, the dictionary has "driving" to "drive", "drove" to "drive", "AM, was, is" to "be" mapping, when making changes, just look up the dictionary. Stemming and lemmatization are not mutually exclusive relationships, there are intersections, and some words can achieve the same transformation by using both of these methods.

The result of the language processing component (linguistic processor) is called the word (term).

In our case, the word is given as follows:

' Student ', ' Allow ', ' go ', ' their ', ' friend ', ' Allow ', ' drink ', ' beer ', ' my ', ' friend ', ' Jerry ', ' Go ', ' school ', ' see ', ' his ' , "Student", "find", "them", "drink", "allow".

It is also because of the language processing steps to make the search drove, and drive can be searched out.

Fourth Step: Pass the resulting word (term) to the index component (Indexer).

The index component (Indexer) mainly does the following things:

1. Use the resulting word (term) to create a dictionary.

In our example, the dictionary is as follows:

Term Document ID
Student 1
Allow 1
Go 1
Their 1
Friend 1
Allow 1
Drink 1
Beer 1
My 2
Friend 2
Jerry 2
Go 2
School 2
See 2
His 2
Student 2
Find 2
them 2
Drink 2
Allow 2

2. Sort the dictionaries in alphabetical order.

Term Document ID
Allow 1
Allow 1
Allow 2
Beer 1
Drink 1
Drink 2
Find 2
Friend 1
Friend 2
Go 1
Go 2
His 2
Jerry 2
My 2
School 2
See 2
Student 1
Student 2
Their 1
them 2

3. Merge the same word (term) into the document inverted (Posting list) linked list.

In this table, there are several definitions: document Frequency, which is the frequency of documents, indicates how many files contain this word (term) in total. Frequency is the word frequency, which indicates that the file contains several of these words.

So for word "allow", a total of two documents contain this word (term), so that there are two items in the list of documents that follow the words (term), the first of which represents the first document containing "Allow", the 1th document, where "allow" appears 2 times, The second item represents the second document that contains "allow", which is document 2nd, where "Allow" appears 1 times.

So far, the index has been created so that we can quickly find the document we want.

And in the process, we are pleasantly surprised to find that search "drive", "driving", "drove", "driven" can also be searched. Because in our index, "driving", "drove", "driven" will be the language processing and become "drive", when searching, if you enter "driving", the input query statement also passes our here one to three steps, thus becomes the query "drive", So you can search for the document you want.

third, how to search the index.

It seems that we can announce that we have found the documents we want.

But it's not over, and it's just one aspect of full-text search. Isn't it. If only one or 10 documents contain the string of our query, we did find it. However, if there are 1000 or even thousands of results. That's the file you want most.

Open Google, say you want to find a job at Microsoft, so you enter "Microsoft Job", you find a total of 22.6 million results returned. A big number, suddenly found to be a problem, found too much is also a problem. In so many results, how to put the most relevant in front of it.

Of course Google did a great job, and you found jobs at Microsoft. Imagine how horrible it would be if the first few were all "Microsoft does a good job at software industry ...".

How to find the most relevant query statements in thousands of search results, like Google.

How to determine the relevance of a search for a document and a query statement.

This is going back to our third question: How to search the index.

Search is mainly divided into the following steps: The first step: User input query statement.

The query statement is the same as our common language, also has certain grammar.

Different query statements have different syntax, such as SQL statements that have a certain syntax.

The syntax of a query statement differs based on the implementation of the full-text retrieval system. The most basic examples are: and, OR, not.

For example, a user input statement: Lucene and learned not Hadoop.

Indicates that the user is looking for a document that contains Lucene and learned but does not include Hadoop. The second step: lexical analysis, grammar analysis, and language processing of query statements.

Because the query statement has syntax, it is also necessary to do syntax analysis, grammar analysis and language processing.

1. Lexical analysis is mainly used to identify words and keywords.

In the above example, after lexical analysis, the word has lucene,learned,hadoop, the key word has and, not.

If an illegal keyword is found in the lexical analysis, an error occurs. such as Lucene AMD learned, where due to and misspelled, led to AMD as a common word to participate in the query.

2. Grammatical analysis is mainly based on the syntax rules of query statements to form a syntax tree.

If the query statement does not meet the syntax rules, an error is found. If Lucene is not and learned, an error occurs.

As the above example, Lucene and learned not Hadoop form the syntax tree as follows:

3. Language processing is almost identical to the language processing in the indexing process.

such as learned into learn and so on.

After the second step, we get a language-processed syntax tree.

The third step: Search index, get the document that conforms to the syntax tree.

This step has a few small steps: First, in the Reverse Index table, identify the document linked list containing the Lucene,learn,hadoop, respectively. Secondly, the chain list containing Lucene,learn is merged, and a list of documents containing both Lucene and learn is obtained. This linked list is then poorly manipulated with the Hadoop document list, removing the document containing Hadoop, resulting in a list of documents that contain both Lucene and learn and that do not contain Hadoop. This list of documents is the document we are looking for. Fourth Step: The results are sorted according to the relevance of the obtained documents and query statements.

Although in the previous step, we got the desired document, however, the query results should be sorted according to the relevance of the query statement, the more relevant people on the front.

How do I calculate the relevance of a document and a query statement?

We should consider the query statement as a short document, the correlation between the document and the document (relevance) scored (scoring), the high score of relevance, it should be in front.

So how do you rate the relationship between documents? This is not an easy thing to do.

Below

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.