Lucene Index Search

Last Update:2018-03-21 Source: Internet

Author: User

Tags solr split words

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Forwarded from: https://my.oschina.net/u/3777556/blog/1647031

What is Lucene??

Lucene is an open source full-Text Search engine Toolkit, published by the Apache Software Foundation, written by Doug Cutting, a full-text search engine, that provides a full-text index and query index, as well as a partial text analysis engine.

Lucene's goal is to provide software developers with a simple and easy-to-use toolkit to facilitate full-text retrieval in the target system, or to build a full-text search engine based on this, Lucene is a classic ancestor in full-text search, and now many search engines are created on its basis. , the mind is interlinked.

Lucene is a text search tool based on key words that searches for text content only within a site and cannot be searched across sites.

Since the site of the search, then we will talk about our familiar Baidu, Google those search engine is based on what search?

From the picture has seen very clearly, Baidu, Google and other search engines are in fact through the web crawler to search the program ...

Why are we using Lucene?

In the introduction of Lucene, we have said: Lucene is not a search engine, just a text search within the site. Then why do we have to learn him???

Before we wrote the tax service system, we had already used SQL to search the site.

Since SQL can do the function, we also need to learn Lucene, why???

Let's see what we can do with SQL to search for:

(1) SQL can only be searched for database tables, not directly for text on hard disk
(2) SQL does not have a correlation rating
(3) SQL search results without keyword highlighting
(4) SQL requires database support, the database itself requires a large memory overhead, for example: Oracle
(5) SQL search is sometimes slow, especially if the database is not local, such as: Oracle

Let's take a look at what it looks like in Baidu to search for Lucene for the keyword:

As mentioned above, we cannot do this if we use SQL. So we learn lucene to help us in the station according to the text keyword to search data!

If we need to search by keyword, we can use SQL, or we can use Lucene ... Then our lucene and SQL are the same, both of which write code in the persistence layer.

First, Quick Start

Next, we'll explain how to use Lucene ..... Before we explain the Lucene API, let's start by talking about what the Lucene is storing ... Our SQL uses the memory in the database, the DBF file on the hard disk ... So what's inside of Lucene?

Lucene is a series of binary compressed files and some control files, which are located on the computer's hard disk, these are collectively referred to as the index library, the index library has two parts:

(1) Original record

The original text that is deposited into the index library, for example: I'm Jongforchen.

(2) Glossary

After each character in the original record is separated by a certain split strategy (that is, a word breaker), it is deposited into a table for future searches

In other words: where Lucene stores data, we often call it an index library, and the index library is divided into two parts: the original record and the glossary ....

1.1 Original Records and glossary

When we want to store the data in the index library, we first deposit the data to the original record ....

Also because we give the user the time, the user uses the keyword to carry on the inquiry our specific record. Therefore, we need to split the data we originally stored in! The split data is stored in the glossary.

The glossary is similar to the index table we are learning in Oracle, and the corresponding index values are given when splitting.

Once the user searches according to the keyword, then the program will first go to query the vocabulary there is no such keyword, if there is a keyword is located in the original record table, the matching original record to the user to view.

We look at the following diagram to facilitate understanding:

Here, someone may wonder: is the original record split the data is a single character to split it?? Then there are a lot of keywords in the vocabulary list???

In fact, when we are in the original record table, we can specify which algorithm we use to split the data into the glossary ..... Our graph is Lucene's standard word segmentation algorithm, a Chinese character is split. We can use other word segmentation algorithms, two of two split or other algorithms.

1.2 Writing the first Lucene program

First, we'll import the necessary development packages for Lucene:

Lucene-core-3.0.2.jar "Lucene Core"
Lucene-analyzers-3.0.2.jar "word breaker"
Lucene-highlighter-3.0.2.jar "Lucene will search out the word, highlighting, prompting the user"
Lucene-memory-3.0.2.jar "Index library optimization Strategy"

Create user object, user object encapsulates data ....

/**

* Created by Ozc on 2017/7/12.

*/

public class User {

Private String ID;

Private String UserName;

Private String Sal;

Public User () {

}

Public User (string ID, String userName, string sal) {

This.id = ID;

This.username = UserName;

This.sal = sal;

}

Public String getId () {

return ID;

}

public void SetId (String id) {

This.id = ID;

}

Public String GetUserName () {

return userName;

}

public void Setusername (String userName) {

This.username = UserName;

}

Public String getsal () {

return Sal;

}

public void Setsal (String sal) {

This.sal = sal;

}

}

We want to use Lucene to query the data in the outbound, first we have to have an index library! So we first create an index library to store our data in the index library.

Steps to create an index library:

1) Create JavaBean Object
2) Create Docment Object
3) Place all the attribute values of the JavaBean object in the Document object, and the property name can be the same as or different from the JavaBean
4) Create IndexWriter object
5) writes the document object to the index library through the IndexWriter object
6) Close the IndexWriter object

Once the program is executed, we will see our index library on the hard drive.

So now we don't know if the records are really being stored in the index library because we can't see them. The data placed in the index inventory is placed under the CFS file, and we cannot open the CFS file.

So, we now use a keyword to read the data from the index library. See if reading data is successful.

Query the contents of the index library according to the keyword:

1) Create Indexsearcher Object
2) Create Queryparser Object
3) Create a query object to encapsulate the keywords
4) Use the Indexsearcher object to search the index library for the first 100 records that meet the criteria, less than 100 records, whichever is the actual
5) Get the qualifying number
6) Use the Indexsearcher object to index the document object corresponding to the query number in the library
7) Remove all the attributes from the Document object, and then encapsulate them back into the JavaBean object and save them in the collection for use

Effect:

1.3 Further description of Lucene code

Our Lucene program is about the idea of encapsulating a JavaBean object into a Document object and then writing the document to the index library through IndexWriter. When the user needs to query, the use of Indexsearcher from the index library to read the data, find the corresponding Document object, so as to parse the contents inside, and then encapsulated in the JavaBean object let us use.

Second, the Lucene code optimization

Once again, we look back at the code we wrote on the QuickStart, and I'm going to intercept some of the representative:

The following code appears when populating data into the index library and querying data from the index library. is duplicate code!

The following code actually encapsulates the JavaBean data into the Document object, which we can encapsulate by reflection .... If we don't encapsulate it, if we have a lot of javabean to add to the Document object, there will be a lot of similar code.

The following code removes the data from the document object and encapsulates it to JavaBean. If there are many properties in JavaBean, it is also necessary for us to write many similar codes ....

2.1 Writing Lucene Tool classes

When writing a tool class, it's worth noting that:

When we get the properties of the object, we can encapsulate the property's Get method.
Get the Get method, you can call it, get the corresponding value
When manipulating the object's properties, we want to use brute force access
If there are attributes, values, objects of these three variables, we remember to use the Beanutils component

2.2 Using the Luceneutils retrofit program

Third, index library optimization

We can already create an index library and read the object's data from the index library. In fact, the index library has a place to optimize ....

3.1 Merging files

When we add data to the index library, we automatically create a CFS file for each time we add it ...

This is not good, because if the amount of data a large, our hard disk has very very many CFS files ..... In fact, the index library will help us automatically merge files, the default is 10.

If we want to modify the default value, we can modify it with the following code:

3.2 Setting the Memory index Library

Our current program is to work directly with the file, so that the cost of IO is actually relatively large. And the speed is relatively slow .... We can use the memory index library to improve our reading and writing efficiency ...

For the memory index library, it is very fast, because we directly manipulate the memory ... However, we want to save the memory index library to the hard disk Index library. When we read the data, we must first synchronize the data of the hard disk index library to the memory index library.

Four, the word breaker

As we have already said in the previous section, when we save the data to the index library, we use some algorithms to save the data from the original record table to the glossary ... So the sum of these algorithms we can call the word breaker

Word breaker: * * Using an algorithm, the characters in the Chinese and English text split to form a vocabulary, to be used by the user input key words after search * *

For why to use the word breaker, we also explicitly said: Because the user is not able to complete the record of our original records, so they in the search, is through the key words to the original records of the table query .... At this point, we use a word breaker to match the relevant data to the maximum

4.1 Word breaker Flow

Step one: Split words by word breaker
Step two: Remove the stop word and disable word
Step three: If you have English, the English letter to lowercase, that is, the search is not case-sensitive

4.2 Word Breaker API

When we choose the word segmentation algorithm, we will find that there are very very many Word breaker API, we can use the following code to see how the word breaker is splitting the data:

After the experiment, we can choose the appropriate word segmentation algorithm ....

4.3IKAnalyzer Word Breaker

This is a third-party word breaker, we need to import the corresponding jar package if we want to use it

Ikanalyzer3.2.0stable.jar
Step two: Copy the IKAnalyzer.cfg.xml and Stopword.dic and xxx.dic files to the SRC directory of the MyEclipse, then configure the first line in the configuration, a blank line is required

What's so good about this third-party word breaker???? He is the preferred Chinese word breaker ... That is to say: he is in accordance with the Chinese words to split!

V. Processing of search results

5.1 Search Results highlighting

When we use SQL, the data we search for is not highlighted ... And we use Lucene, search out the content we can set the keyword for highlighting ... This will pay more attention to the user experience!

5.2 Search Results Summary

If we search for articles that are too large and we just want to show some of the content, we can summarize them ...

It's worth noting that the search results summary needs to be used with set highlighting

5.3 Search results sorted by

We certainly use a lot of search engines, using different search engines to search for the same content. The order of their homepage will also be different ... This is the sort of search results they used internally ....

There are a number of ways that affect Web pages:

head/meta/"keywords keyword"
Page labels are neat
Web page Execution Speed
With Div+css
Wait, wait.

In Lucene, we can set the correlation score so that different results are sorted:

IndexWriter indexwriter = new IndexWriter (Luceneutil.getdirectory (), Luceneutil.getanalyzer (), Luceneutil.getmaxfieldlength ());

Set a score for a result

Document.setboost (20F);

Indexwriter.adddocument (document);

Indexwriter.close ();

Of course, we can also sort by a single field:

You can also sort by more than one field: In a multi-field sort, only the first field is sorted in the same order, and the second field is sorted to promote a numeric sort

5.4 Item Search

In our example, we are using a keyword to search the contents of a field. The syntax is similar to the following:

In fact, we can also use keywords to search multiple fields, that is, multi-conditional search. We often use multi-conditional search, multi-conditional search can use us to maximize the matching of the corresponding data!

Queryparser queryparser = new Multifieldqueryparser (Luceneutil.getversion (), New string[]{"content", "title"}, Luceneutil.getanalyzer ());

Vi. Summary

Lucene is the ancestor of the full-text indexing engine, and the following SOLR and Elasticsearch are based on Lucene (there will be a elasticsearch later, so please wait ~)
Lucene is a series of binary compressed files and some control files, collectively referred to as the index library, the index library is divided into two parts:

Original
Vocabulary List

Understand how the index library is optimized: 1, merging files 2, setting the Memory index Library
Lucene word breakers have very many, choose their own suitable for a word segmentation
The results of the query can be set to highlight, summarize, sort

This is the tip of the iceberg of Lucene, the general use now may be SOLR, Elasticsearch, but want to learn more about Lucene can read other materials Oh ~

due to typesetting problems, some code can not be displayed completely, complete code please click to read the original text or scan code into the view :

Recommended Reading Spring "AOP module" it's that simple. Python 2.7 support will be terminated on January 1, 2020 start writing simple read and write separation, not difficult! The upcoming JDK 10 has 109 new features approved by the unanimous! Baidu Open Source project Echarts first into Apache incubator

Click "Read the original"See more Highlights

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More