Lucene Introduction, tutorial detailed

Last Update:2018-07-26 Source: Internet

Author: User

Tags lowercase

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction

Lucene is an open source, highly extensible search engine library that can be obtained from the Apache software Foundation. You can use Lucene for both commercial and open source applications. Lucene's powerful API focuses on text indexing and searching. It can be used to build search capabilities for a variety of applications, such as email clients, mailing lists, Web searches, database searches, and more. Lucene is used on websites such as Wikipedia, Theserverside, Jguru and LinkedIn.

Lucene also provides search capabilities for the Eclipse IDE, Nutch (the famous open source Web search engine), and companies such as IBM®, AOL, and Hewlett-Packard. Lucene is already compatible with many other programming languages, including Perl, Python, C + +, and. NET. Up to July 30, 2009, the latest version of Lucene for the Java™ programming language is V2.4.1.

Lucene has many features: powerful, accurate, and efficient search algorithms. Calculates the score for each document that matches a given query and returns the most relevant document based on the score. Supports many powerful query types, such as Phrasequery, Wildcardquery, Rangequery, Fuzzyquery, Booleanquery, and so on. Supports parsing rich query expressions entered by people. Allows users to resolve extended search behavior using custom sorting, filtering, and query expressions. Protects concurrent index modifications using file-based locking mechanisms. Allows simultaneous search and indexing.

As shown in Figure 1, using Lucene to build a full-featured search application involves compiling data indexes, searching for data, and displaying search results in several ways. Figure 1 Steps to build an application using Lucene

This article picks up some code snippets from the sample application developed using Lucene V2.4.1 and Java technology. The sample application indexes a set of e-mail documents stored in the properties file and shows how to search for indexes using Lucene's query API. The example also lets you familiarize yourself with the basic index operations.

Back to page top for data indexing

Lucene allows you to index data in any text format. Lucene can be used for almost any data source and to extract textual information from it. You can use Lucene to index and search for data stored in HTML documents, Microsoft®word documents, and PDF files. The first step in compiling a data index is to make the data a simple text format. You can do this with custom parsers and data converters. The process of indexing

Indexing is the conversion of text data into a format that facilitates fast searching. This is similar to the index behind the book: it shows you where the subject appears in the book.

Lucene stores the input data in a data structure called a reverse index, which is stored in the file system or in memory as an indexed set of files. Most WEB search engines use reverse indexing. It allows the user to execute a fast keyword query to find a document that matches a given query. The parser (using the analysis process) processes the text data before it is added to the index. Analysis

Parsing is the process of converting text data into a basic unit of search, called an item (term). During the analysis, the text data goes through a number of actions: extracting words, removing common words, ignoring punctuation, turning words into root forms, turning words into lowercase, and so on. The parsing process occurs prior to indexing and query parsing. Parsing converts text data to tokens that are added to the Lucene index as items.

Lucene has a variety of built-in analytical procedures, such as Simpleanalyzer, StandardAnalyzer, Stopanalyzer, Snowballanalyzer, etc. They differ in the way they mark text and apply filters. Because parsing removes words before indexing, it reduces the size of the index, but does not take advantage of an accurate query process. You can use the basic building blocks provided by Lucene to create custom analysis programs that control the analysis process in your own way. Table 1 shows some of the built-in analysis programs and how they work with data. table 1. Lucene's built-in analysis program

Analysis Program	working with text data
Whitespaceanalyzer	Break down the mark in the white space
Simpleanalyzer	Breaks down non-alphabetic characters and converts text to lowercase
Stopanalyzer	Remove virtual fields (stop word)--to retrieve useless words and convert text to lowercase
StandardAnalyzer	Mark text based on a complex syntax (identifying e-mail addresses, abbreviations, Chinese, Japanese, Korean characters, alphanumeric, and so on) Convert Text to lowercase Remove virtual fields

Core Index Compilation classDirectory represents the abstract class where the index file is stored. There are two commonly used subclasses: fsdirectory-the Directory implementation that stores the index in the actual file system. This class is useful for large indexes. Ramdirectory-stores the implementation of all indexes in memory. This class applies to smaller indexes that can be fully loaded into memory and destroyed after the application terminates. Because the index is stored in memory, it is relatively fast. Analyzer as described above, the parser is responsible for processing the text data and converting it to tokens stored in the index. Before indexing, IndexWriter receives the parser that is used to tag the data. To index text, you should use an parser that is appropriate for that text language.

The default parser is available in English. There are other analytical procedures in the Lucene sandbox, including those for Chinese, Japanese, and Korean. Indexdeletionpolicy This interface is used to implement a policy for customizing the deletion of obsolete commits from an indexed directory. The default deletion policy is Keeponlylastcommitdeletionpolicy, which retains only the most recent commits and removes all previous commits immediately after completing some commits. IndexWriter A class that creates or maintains an index. Its constructor receives a Boolean value, determines whether to create a new index, or opens an existing index. It provides methods for adding, deleting, and updating documents in the index.

Changes made to the index are initially cached in memory and periodically dumped into the index directory. IndexWriter exposes several fields that control how indexes are cached in memory and written to disk. Changes to the index are not visible to indexreader unless you call IndexWriter's commit or Close method. IndexWriter creates a directory lock file to protect the index from corruption by synchronizing index updates. IndexWriter allows the user to specify an optional index deletion policy. list 1. Using Lucene indexwriter

Create instance of directory where index files would be stored
Directory fsdirectory =  fsdirectory.getdirectory ( Indexdirectory);
/* Create Instance of Analyzer, which would be used to tokenize the
input data */
Analyzer StandardAnalyzer = new S Tandardanalyzer ();
Create a new index
Boolean create = true;
Create the instance of deletion policy
indexdeletionpolicy deletionpolicy = new Keeponlylastcommitdeletionpolicy ( ); 
IndexWriter =new IndexWriter (fsdirectory,standardanalyzer,create,
	Deletionpolicy, IndexWriter.MaxFieldLength.UNLIMITED);

adding data to an index

Adding text data to an index involves two classes.

Field represents a piece of data that is queried or retrieved in a search. The field class encapsulates the name and its value of one of the fields. Lucene provides options to specify whether a field needs to be indexed or analyzed, and whether the value needs to be stored. These options can be passed when a field instance is created. The following table shows the details of the Field metadata option. table 2. Field meta Data options for more information

Options	Description
Field.Store.Yes	Used to store field values. Applies To fields that display search results-for example, file paths and URLs.
Field.Store.No	No field values are stored-for example, the e-mail message body.
Field.Index.No	Applies To fields that are not searched-for storing fields only, such as file paths.
Field.Index.ANALYZED	Used for field indexing and analysis-for example, e-mail message body and title.
Field.Index.NOT_ANALYZED	Fields that are used for indexing but not parsing. It retains the original value of the field in the whole-for example, date and personal name.

Document is a collection of fields. Lucene also supports advancing documents and fields, which is useful when assigning importance to certain indexed data. Indexing text files includes wrapping text data in fields, creating documents, populating fields, and adding documents to the index using IndexWriter.

Listing 2 shows an example of adding data to an index. Listing 2. Adding data to an index

/*step 1. Prepare the data for indexing. Extract the data.
*/String Sender = Properties.getproperty ("sender");
String date = Properties.getproperty ("date");
String subject = Properties.getproperty ("subject");
String message = Properties.getproperty ("message");

String Emaildoc = File.getabsolutepath (); /* Step 2. Wrap the data in the fields and add them to a Document */field Senderfield = new Field ("Sender", Sender,field.store.yes,
Field.Index.NOT_ANALYZED); 
Field Emaildatefield = new Field ("Date", date,field.store.no,field.index.not_analyzed);
Field Subjectfield = new Field ("Subject", subject,field.store.yes,field.index.analyzed);
Field Messagefield = new Field ("Message", message,field.store.no,field.index.analyzed);

Field Emaildocfield = new Field ("Emaildoc", Emaildoc,field.store.yes, Field.Index.NO);
Document doc = new document ();
Add These fields to a Lucene Document doc.add (Senderfield);
Doc.add (Emaildatefield);
Doc.add (Subjectfield);
Doc.add (Messagefield); Doc. Add (Emaildocfield);
Step 3:add This document to Lucene Index. Indexwriter.adddocument (DOC);

Search is the process of finding words in an index and finding documents that contain those words. The search functionality built using Lucene's search API is straightforward. This section discusses the main classes of the Lucene search API. Searcher

Searcher is an abstract base class that contains a variety of overloaded search methods. Indexsearcher is a common subclass that allows the search index to be stored in a given directory. <

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More