Lucene Introductory Case One

Source: Internet
Author: User

1. Configuring the development environment

Official website: http://lucene.apache.org/

JDK requirements: 1.7 or more

Create the jar package (Lucene-core-4.10.3.jar,lucene-analyzers-common-4.10.3.jar) required for the index library

Other jar packages (Commons-io-2.4.jar, Junit-4.9.jar)

2. Create an index library

The first step: Create a Java project and import the jar package.

Step two: Create a IndexWriter object.

1) Specify the location of the index Library directory object

2) Specify a parser to analyze the contents of the document.

Step Two: Create the Document object.

Step three: Create a Field object and add field to the Document object.

Fourth step: Use the IndexWriter object to write the document object to the index library, which is created by the index. and writes the index and document objects to the index library.

Fifth step: Close the IndexWriter object.

/*
* Create an index library
*/
@Test
public void CreateIndex () throws Exception {
1. Specify the location of the index inventory, can be stored in memory, or can be stored on the hard disk
Directory directory = new ramdirectory ();//Save index in memory
Directory directory = fsdirectory. Open (New File ("E:\\temp\\index"));
Analyzer Analyzer = new StandardAnalyzer();

Indexwriterconfig, the first parameter represents the Lucene version number, and the second parameter represents the parser
Indexwriterconfig indexwriterconfig = newIndexwriterconfig(version.latest, analyzer);
2. Create a IndexWriter object, you need to create a parser (word breaker) before this
IndexWriter IndexWriter = newIndexWriter(directory, indexwriterconfig);
3. Get the file, read the file using IO stream
File File = new file ("E:\\temp\\source_folder");
For (File f:file.listfiles ()) {
Get file name
String fileName = F.getname ();
Get File size
Long fileSize = fileutils.sizeof (f);
Get file Contents
String filecontent = fileutils.readfiletostring (f);
Get file path
String FilePath = F.getpath ();
4. Create a Document Object
Document document = newDocument();
5. Create a domain object and add it to the document
5.1 Save file name, need analysis, need index, need to save
Stringfield NameField = newStringfield("Name", FileName, Store.yes);
5.2 Save the file content, need analysis, need index, do not need to save
TextField Contentfield = new TextField ("content", filecontent, store.no);
5.3 Save file path, do not parse, do not index, to save
Storedfield Pathfield = new Storedfield ("path", FilePath);
5.4 Save file length, do not index, do not parse, to save
Longfield Sizefield = new Longfield ("size", fileSize, Store.yes);
Document.add (NameField);
Document.add (Contentfield);
Document.add (Pathfield);
Document.add (Sizefield);
6. Writing document objects to the index library
indexwriter.adddocument (document);
}
7. Releasing Resources
indexwriter.close ();
}

3. Querying the Index

Steps:

The first step: Create a Directory object, which is where the index inventory is placed.

Step two: Create a Indexreader object, you need to specify the directory object.

Step three: Create a Indexsearcher object, you need to specify the Indexreader object

Fourth step: Create a Termquery object that specifies the domain of the query and the keyword of the query.

Fifth step: Execute the query.

Sixth step: Return the query results. Traverse the query results and output.

Seventh step: Close the Indexreader object

  Indexsearcher Search Method

Method

Description

Indexsearcher.search (query, N)

Returns the highest rated N records according to query search

Indexsearcher.search (query, filter, N)

Based on query search, add a filter policy to return the highest rated N records

Indexsearcher.search (query, N, sort)

Based on query search, add a sort policy to return the highest rated N records

Indexsearcher.search (booleanquery, filter, N, sort)

Based on query search, add a filtering policy, add a sorting strategy, and return the highest rated N records

Topdocs

Lucene search results can be traversed by Topdocs , and the Topdocs class provides a small number of attributes, as follows:

Method or Property

Description

Totalhits

Total number of records matching search criteria

Scoredocs

Top Matching Records

Attention:

The search method needs to specify the number of matching records n:indexsearcher.search (query, N)

  topdocs.totalhits: Is the number that matches all records in the index library

  Topdocs.scoredocs: Match the high correlation of the front record array, the length of scoredocs is less than equal to the search method specified by the parameter n

/*
* Query Index Library
*/
@Test
public void Searchindex () throws Exception {
1. Specify where the index is stored
Directory directory = Fsdirectory.open (new File ("E:\\temp\\index"));
2. Create a Indexreader object
Indexreader Indexreader = directoryreader.open (directory);
3. Create a Indexsearcher object that requires the Indexreader object to be constructed by the object
Indexsearcher indexsearcher = new Indexsearcher (Indexreader);
4. Create a query object that needs to indicate the query keywords and domain names
Query query = new Termquery (New term ("content", "Lucene"));
5. Get query Results
Topdocs Topdocs = indexsearcher.search (query, 10);
Print the total number of records for a query
SYSTEM.OUT.PRINTLN ("Total number of records queried:" + topdocs.totalhits);
for (Scoredoc ScoreDoc:topDocs.scoreDocs) {
6. Facilitate printing of query results
Get ID of document
int DocID = Scoredoc.doc;
SYSTEM.OUT.PRINTLN ("Document ID:--" + DocID);
Querying document objects by ID
Document document = Indexsearcher.doc (DocID);
Get the appropriate information based on the Document object
System.out.println ("Document name:" + document.get ("name"));
SYSTEM.OUT.PRINTLN ("Document path:" + document.get ("path"));
SYSTEM.OUT.PRINTLN ("Document size:" + document.get ("sizes"));
System.out.println ("Contents of document:" + document.get ("content"));
}
7. Close Indexreader
Indexreader.close ();
}

4. Analyzer Test

the execution of the 4.1 Parser (Analyzer)

As the generation process for a vocabulary unit:

  

Starting with a stream of reader characters, create a reader-based tokenizer word breaker that generates token tokens for three tokenfilter .

To see the analysis of the analyzer, you just need to see the content in the Tokenstream. Each parser has a method Tokenstream, which returns a Tokenstream object.

  4.2 Word breaker Test

/*
* Standard Analyzer Test
*/
@SuppressWarnings ("resource")
@Test
public void Testanalyzer () throws Exception {
1. Create a Parser object
Analyzer Analyzer = newStandardAnalyzer();
Analyzer Analyzer = new Ikanalyzer ();//ik-analyzer Word breaker
2. Get Tokenstream from the parser object
The first parameter is fieldname, the domain name, which can be null or ""
The second argument is a string that needs to be parsed
Tokenstream Tokenstream =Analyzer.tokenstream("", "Xiao Xin like mating");
3. Set a reference, reference can have more than one type, can be a keyword reference, offset reference, etc.
Chartermattribute Chartermattribute =Tokenstream.addattribute (chartermattribute.class);
Offsetattribute Offsetattribute =Tokenstream.addattribute (offsetattribute.class);
4. Call the Reset method of Tokenstream to reset the pointer
Tokenstream.reset ();
5. Use while loop to facilitate word list
while (Tokenstream.incrementtoken ()) {
6. Printing words
System.out.println ("Start of keywords--" + offsetattribute.startoffset ());
System.out.println ("keyword:-->" + chartermattribute);
System.out.println ("End of keywords--" + offsetattribute.endoffset ());
}
7. Close Tokenstream
Tokenstream.close ();
}

  4.3 Chinese parser

1.Lucene with Chinese word breaker

StandardAnalyzer:

Word participle: is to follow the Chinese word word by word. such as: "I love China",
Effect: "I", "Love", "Zhong", "country".

Cjkanalyzer

Dichotomy: divide by two characters. such as: "I am Chinese", the effect: "I Am", "is Medium", "China" "Chinese".

The top two word breakers do not meet the requirements.

Smartchineseanalyzer

Good support for Chinese, but poor extensibility, extended thesaurus, disable thesaurus, and other difficult to handle

2. Third-party Chinese parser
  • Paoding: Discovering latest version in https://code.google.com/p/paoding/support Lucene 3.0, and the latest submitted code in 2008-06-03, the latest in SVN is also submitted in 2010, is outdated, Not considered.
  • MMSEG4J: The latest version has moved from https://code.google.com/p/mmseg4j/to HTTPS://GITHUB.COM/CHENLB/MMSEG4J-SOLR, supporting Lucene 4.10, and the latest commit code in GitHub is June 2014, from 09 ~14, there are: 18 versions, that is, almost 3 size versions of a year, with a greater degree of activity, using the MMSEG algorithm.
  • Ik-analyzer: The latest version on https://code.google.com/p/ik-analyzer/, support Lucene 4.10 Since December 2006 launch of the 1.0 version, Ikanalyzer has launched 4 large versions. Initially, it is an open source project Luence as the application of the main, combined with the dictionary word segmentation and Grammar analysis algorithm in Chinese language sub-phrase pieces. Starting with the 3.0 release, IK evolved into a common Java-oriented word breaker, independent of the Lucene project, while providing the default optimizations for Lucene. In the 2012 version, IK implements a simple word segmentation ambiguity elimination algorithm, which marks the derivation of the IK word breaker from the simple dictionary participle to the simulation semantic participle. But it was not updated after December 2012.
  • ANSJ_SEG: The latest version in HTTPS://GITHUB.COM/NLPCHINA/ANSJ_SEG tags has only 1.1 version, from 2012 to 2014 updated size 6 times, but the author himself on October 10, 2014 explained: " Maybe I will not have the energy to maintain the ansj_seg ", now by" Nlp_china "management. November 2014 is updated. Does not indicate whether Lucene is supported, is a word segmentation algorithm made by the CRF (conditional random field) algorithm.
  • Imdict-chinese-analyzer: The latest version in https://code.google.com/p/imdict-chinese-analyzer/, the latest update also in May 2009, download the source code, does not support Lucene 4.10. Is the use of hmm (hidden Markov chain) algorithm.
  • JCSEG: The latest version in Git.oschina.net/lionsoul/jcseg, supports Lucene 4.10, and the author has a high degree of activity. Using the MMSEG algorithm.

  How to use 4.4 ik-analyzer Chinese parser

How to use:

The first step: Add the jar package to the project

Step Two: Add the configuration file and the extension dictionary and the Stop Word dictionary to Classpath

Note: The format of the Mydict.dic and Ext_stopword.dic files is UTF-8, note that there is no BOM UTF-8 encoding.

  4.5 Use time of the word breaker

Use analyzer when indexing

Enter a keyword to search, and when you need to match the keyword with the words contained in the contents of the document field, you need to analyze the contents of the document field and you need to pass the Analyzer parser to process the generated tokens. The parser parses an object that is a field field in the document. The field value is parsed when the field's attribute tokenized (whether participle) is true, such as:

For some field you can use no analysis:

1, not as the content of the query criteria, such as file path

2, not match the content of the word and match the overall content of field, such as order number, ID number and so on.

Use analyzer when searching

As with search keywords for analysis and index analysis, use Analyzer to analyze search keywords, word-breaker processing, and use each word after analysis to search. For example: Search keywords: spring Web, through the analysis of Word segmentation, the conclusion: Spring Web take Word to index dictionary table lookup, find the index link to document, parse document content.

The query that matches the whole field field can be searched without analysis, for example, according to the order number, Social Security number query and so on.

Note: The parser used by the search is consistent with the parser used by the index.

5. Adding an index library

Adds a Document object to the index library.

First step: Create a IndexWriter object first

Step Two: Create a Document Object

Step three: Write the Document object to the index library

Fourth step: Turn off IndexWriter.

/**
* Create indexwriter based on the location of the index library
*/
public static IndexWriter Getindexwriter () throws IOException {
1. Specify the location of the index inventory drop
Directory directory = Fsdirectory.open (new File ("E:\\temp\\index"));
2. Create a IndexWriter object, you need to create a parser and indexwriterconfig
Analyzer Analyzer = new Ikanalyzer ();
Indexwriterconfig indexwriterconfig = new Indexwriterconfig (version.latest, analyzer);
IndexWriter indexwriter = new IndexWriter (directory, indexwriterconfig);
return indexwriter;
}

Add a document
@Test
public void Adddocument () throws Exception {
IndexWriter indexwriter = Getindexwriter ();
3. Create a Document Object
Document document = new document ();
4. Create a Domain object
TextField NameField = new TextField ("name", "Small New Document", Store.yes);
TextField Contentfield = new TextField ("Content", "little new like Mating", store.yes);
5. Adding a domain object to a Document object
Document.add (NameField);
Document.add (Contentfield);
6. Writing document objects to the index
Indexwriter.adddocument (document);
7. Close IndexWriter
Indexwriter.close ();
}

6. Deletion of the index library

6.1 Delete all

Delete all indexes
@Test
public void Deleteallindex () throws Exception {
Get IndexWriter Object
IndexWriter indexwriter = Getindexwriter ();
To call a method in IndexWriter to delete an index
Indexwriter.deleteall ();
Freeing resources
Indexwriter.close ();
}

6.2 Delete based on query

To delete an index based on query criteria
@Test
public void Deletequeryindex () throws Exception {
Get IndexWriter
IndexWriter indexwriter = Getindexwriter ();
Set up query conditions
Query query = new Termquery (New term ("content", "small New"));
Delete a document that is under the condition of making a query
Indexwriter.deletedocuments (query);
Freeing resources
Indexwriter.close ();
}

7. Updating the index Library

Update Index Library
@Test
public void Updateindex () throws Exception {
1. Get the IndexWriter object
IndexWriter indexwriter = Getindexwriter ();
2. The process of querying is to make the fields and keywords of the term to be deleted, first according to the term query, then delete the query result, and then add the Doucment object
Document document = new document ();
Document.add (New TextField ("name", "Updated document", Store.yes));
Document.add (New TextField ("content", "Updated document Contents", Store.yes));
Indexwriter.updatedocument (New term ("content", "treat"), document);
3. Releasing Resources
Indexwriter.close ();
}

Lucene Introductory Case One

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.