No mercy on good things
From http://www.ibm.com/developerworks/cn/opensource/os-apache-lucenesearch/
Introduction
Lucene is an open-source, highly scalable search engine library that can be obtained from the Apache Software Foundation. You can use Lucene for commercial and open-source applications. Lucene's powerful API mainly focuses on text indexing and search. It can be used to build search functions for various applications, such as email client, email list, Web search, and database search. Websites such as Wikipedia, theserverside, jguru, and LinkedIn all use Lucene.
Lucene also provides Eclipse IDE, nutch (the famous open source web search engine), and IBM®Companies such as AOL and Hewlett-Packard provide search functions. Lucene is compatible with many other programming languages, including Perl, Python, C ++, And. net. For Java™The latest version of Lucene in programming language is v2.4.1.
Lucene features:
- Powerful, accurate, and effective search algorithms.
- Calculate the scores of each document matching a given query and return the most relevant documents based on the scores.
- Supports many powerful query types, such as phrasequery, wildcardquery, rangequery, fuzzyquery, and booleanquery.
- Supports parsing rich Query expressions input by people.
- Allows you to use custom sorting, filtering, and Query expressions to parse extended search behavior.
- Use the file-based locking mechanism to protect concurrent index modifications.
- Allows simultaneous search and indexing.
Back to Top
Use Lucene to build applications
1. Using Lucene to build a comprehensive search application mainly involves compiling data indexes, searching data, and displaying search results.
Figure 1. Steps to build an application using Lucene
This article selects some code snippets from the sample applications developed using Lucene v2.4.1 and Java technology. The example application compiles indexes for a group of e-mail documents stored in the property file and shows how to use Lucene's query API to search for indexes. This example also familiarizes you with basic index operations.
Back to Top
Index data
Lucene allows you to index data in any text format. Lucene can be used for almost any data source and text information extracted from it. You can use Lucene to compile indexes and search for HTML documents, Microsoft®Data stored in word and PDF files. The first step in data indexing is to convert data into a simple text format. You can use a custom parser and data converter to achieve this.
Indexing Process
IndexingConverts text data to a format that facilitates quick search. This is similar to the index behind a book: it shows you where the topic appears in the book.
Lucene stores the input data inReverse OrderIn the indexed data structure, the data structure is stored in the file system or memory in the form of an indexed file set. Most Web search engines use reverse indexing. It allows you to perform a quick keyword query to find documents that match a given query. Before adding text data to the index, the Analysis Program (using the analysis process) processes the data.
Analysis
AnalysisIs the basic unit for converting text data to search (calledTerm). In the analysis process, text data goes through multiple operations: extracting words, removing common words, ignoring punctuation marks, turning words into the root form, and changing words to lowercase letters. The analysis process occurs before indexing and query parsing. Analysis converts text data to tags that are added as items to Lucene indexes.
Lucene has a variety of built-in analysis programs, such as simpleanalyzer, standardanalyzer, stopanalyzer, and snowballanalyzer. They differ in the way text is tagged and filters are applied. Because the analysis removes words before indexing, it reduces the size of the index, but does not use the exact Query Process. You can use the basic building blocks provided by Lucene to create custom analysis programs and control the analysis process in your own way. Table 1 shows some built-in analysis programs and their data processing methods.
Table 1. built-in Lucene analysis program
Analytic program |
Operations on text data |
Whitespaceanalyzer |
Mark the blank space |
Simpleanalyzer |
Break down non-Letter text and convert the text into lowercase letters |
Stopanalyzer |
Remove stop word: useless words for retrieval and convert the text to lowercase |
Standardanalyzer |
Mark text based on a complex syntax (identifying email addresses, abbreviations, Chinese characters, Japanese, Korean characters, letters, numbers, and so on) Convert text to lowercase Remove a VM word |
Core Index compilation
-
Directory
-
The abstract class that represents the storage location of the index file. There are two common subclasses:
FSDirectory
-The index is stored in the actual file system.Directory
. This type is very useful for large indexes.
RAMDirectory
-All indexes are stored in the memory. This class applies to small indexes, which can be fully loaded into the memory and destroyed after the application is terminated. Because the index is stored in the memory, the speed is relatively fast.
-
Analyzer
-
As described above, the analyzer is responsible for processing text data and converting it into tags stored in the index. Before indexing,
IndexWriter
Receives the analytic program used to mark the data. To index a text, you should use an analysis program applicable to the text language.
The default analyzer is applicable to English. There are other analysis programs in the Lucene sandbox, including those for Chinese, Japanese, and Korean.
-
IndexDeletionPolicy
-
This interface is used to customize the deletion of outdated commit policies from the index directory. The default deletion policy is
KeepOnlyLastCommitDeletionPolicy
, This policy only keeps the most recent submissions, and immediately removes all previous submissions after some commits are completed.
-
IndexWriter
-
The class that creates or maintains the index. Its constructor receives a Boolean value to determine whether to create a new index or open an existing index. It provides methods to add, delete, and update documents in an index.
Changes made to indexes are initially cached in memory and periodically dumped to the index directory.IndexWriter
It discloses several fields that control how to cache indexes in the memory and write them into the disk. Changes to the indexIndexReader
Invisible unless you callIndexWriter
.IndexWriter
Create a directory lock file to protect the index from being damaged by synchronizing index updates.IndexWriter
You can specify an optional index deletion policy.
List 1. Use Lucene
IndexWriter
//Create instance of Directory where index files will be storedDirectory fsDirectory = FSDirectory.getDirectory(indexDirectory);/* Create instance of analyzer, which will be used to tokenizethe input data */Analyzer standardAnalyzer = new StandardAnalyzer();//Create a new indexboolean create = true;//Create the instance of deletion policyIndexDeletionPolicy deletionPolicy = new KeepOnlyLastCommitDeletionPolicy(); indexWriter =new IndexWriter(fsDirectory,standardAnalyzer,create,deletionPolicy,IndexWriter.MaxFieldLength.UNLIMITED);
Add data to index
Adding text data to an index involves two classes.
Field
It indicates the data slices that are queried or retrieved in the search.Field
Class encapsulates a field name and its value. Lucene provides some options to specify whether the field needs to be indexed or analyzed, and whether the value needs to be stored. These options can be passed when a field instance is created. The following table showsField
The details of the metadata option.
Table 2.
Field
Metadata option details
Option |
Description |
Field. Store. Yes |
Stores field values. Applicable to fields that display search results, such as file paths and URLs. |
Field. Store. No |
No field value is stored-for example, the body of the email message. |
Field. Index. No |
Applicable to unsearched fields-used only to store fields, such as file paths. |
Field. Index. Analyzed |
Used for field indexing and analysis-for example, the body and title of an email message. |
Field. Index. not_analyzed |
Fields used for indexing but not analysis. It retains the original value of the field in the whole-for example, the date and personal name. |
Document
Is a set of fields. Lucene also supports promoting documents and fields, which is useful when attaching importance to some index data. Indexing text files includes encapsulating text data in fields, creating documents, and filling fields.IndexWriter
Add a document to the index.
List 2 shows an example of adding data to an index.
List 2. Add data to the index
/*Step 1. Prepare the data for indexing. Extract the data. */String sender = properties.getProperty("sender");String date = properties.getProperty("date");String subject = properties.getProperty("subject");String message = properties.getProperty("message");String emaildoc = file.getAbsolutePath();/* Step 2. Wrap the data in the Fields and add them to a Document */Field senderField =new Field("sender",sender,Field.Store.YES,Field.Index.NOT_ANALYZED);Field emaildatefield = new Field("date",date,Field.Store.NO,Field.Index.NOT_ANALYZED); Field subjectField = new Field("subject",subject,Field.Store.YES,Field.Index.ANALYZED);Field messagefield = new Field("message",message,Field.Store.NO,Field.Index.ANALYZED);Field emailDocField =new Field("emailDoc",emaildoc,Field.Store.YES,Field.Index.NO);Document doc = new Document();// Add these fields to a Lucene Documentdoc.add(senderField);doc.add(emaildatefield);doc.add(subjectField);doc.add(messagefield);doc.add(emailDocField);//Step 3: Add this document to Lucene Index.indexWriter.addDocument(doc);
Back to Top
Search index data
Searching is the process of searching for words in the index and searching for documents containing these words. The search function built using Lucene's search API is very simple and clear. This section describes the main types of Lucene search APIs.
Searcher
Searcher
Is an abstract base class that contains various overload search methods.IndexSearcher
Is a common subclass that allows you to store search indexes in a given directory.Search
Returns a set of documents sorted by scores. Lucene calculates scores for each document matching a given query.IndexSearcher
Is thread-safe; an instance can be used concurrently by multiple threads.
Term
TermIs the basic unit of search. It consists of two parts: the word text and the name of the field that appears the text. The term object also involves indexing, but can be created in Lucene.
Query and subclass
Query
Is an abstract base class for query. To search for a specified word or phrase involves wrapping them in the item, adding the item to the query object, and passing the query objectIndexSearcher
.
Lucene contains various types of specific query implementations, such as termquery, booleanquery, phrasequery, prefixquery, rangequery, multitermquery, filteredquery, and spanquery. The following section describes the main Query Class of Lucene query API.
-
TermQuery
-
The most basic search type of an index. Can be built using a single item
TermQuery
. Item values are case sensitive, but not all of them are case sensitive. Note that the passed search items should be consistent with the items obtained from the document analysis, because the analysis program performs many operations on the original text before building the index.
For example, consider the email title "job openings for Java javassionals at Bangalore ". Assume that you useStandardAnalyzer
Compile the index. If we useTermQuery
Search for "Java", it will not return any content, because this text should have been normalized and passedStandardAnalyzer
To lowercase. If you search for the lowercase word "Java", it returns all emails containing the word in the title field.
List 3. Use
TermQuery
Search
//Search mails having the word "java" in the subject fieldSearcher indexSearcher = new IndexSearcher(indexDirectory);Term term = new Term("subject","java");Query termQuery = new TermQuery(term); TopDocs topDocs = indexSearcher.search(termQuery,10);
-
RangeQuery
-
You can use
RangeQuery
Search within a certain range. All items in the index are alphabetically arranged. Lucene's
RangeQuery
Allows you to search for items within a certain range. This range can be specified using the start and end items (including both ends or not. List 4. Search within a certain range
/* RangeQuery example:Search mails from 01/06/2009 to 6/06/2009 both inclusive */Term begin = new Term("date","20090601");Term end = new Term("date","20090606");Query query = new RangeQuery(begin, end, true);
-
PrefixQuery
-
You can use
PrefixQuery
Search by prefix words. This method is used to create a query that searches for documents that contain words starting with a specified word prefix. List 5. Use
PrefixQuery
Search
//Search mails having sender field prefixed by the word ‘job‘PrefixQuery prefixQuery = new PrefixQuery(new Term("sender","job"));PrefixQuery query = new PrefixQuery(new Term("sender","job"));
-
BooleanQuery
-
You can use
BooleanQuery
Combine any number of query objects to build a powerful query. It uses
query
And a clause associated with the query, indicating whether the query should occur, must occur or not. In
BooleanQuery
The maximum number of clauses is 1,024 by default. You can call
setMaxClauseCount
Method to set the maximum number of clauses. LIST 6. Use
BooleanQuery
Search
// Search mails have both ‘java‘ and ‘bangalore‘ in the subject fieldQuery query1 = new TermQuery(new Term("subject","java"));Query query2 = new TermQuery(new Term("subject","bangalore"));BooleanQuery query = new BooleanQuery();query.add(query1,BooleanClause.Occur.MUST);query.add(query2,BooleanClause.Occur.MUST);
-
PhraseQuery
-
You can use
PhraseQuery
Search for phrases.
PhraseQuery
Match a document containing a specific word sequence.
PhraseQuery
Use the location information of the items stored in the index. Considering the distance between matched items
Slop. By default, the slop value is zero, which can be called
setSlop
Method.
PhraseQuery
Multiple phrases are also supported. List 7. Use
PhraseQuery
Search
/* PhraseQuery example: Search mails that have phrase ‘job opening j2ee‘ in the subject field.*/PhraseQuery query = new PhraseQuery();query.setSlop(1);query.add(new Term("subject","job"));query.add(new Term("subject","opening"));query.add(new Term("subject","j2ee"));
-
WildcardQuery
-
WildcardQuery
Implement wildcard search and query, which allows you to search for words such as arch * (which can search for words such as architect and architecture. Use two standard wildcards:
*
Zero or more
?
Indicates more than one
If you use the search mode that starts with a wildcard query, the performance may be degraded because you need to query all the items in the index to find matching documents. List 8. Search Using wildcardquery
//Search for ‘arch*‘ to find e-mail messages that have word ‘architect‘ in the subjectfield./Query query = new WildcardQuery(new Term("subject","arch*"));
-
FuzzyQuery
-
You can use
FuzzyQuery
Search for similar items, which match words similar to specified words. The similarity measurement is based on the levenshtein algorithm. In List 9,
FuzzyQuery
It is used to find the closest item to the spelling word "admnistrtor", even though the word is not indexed. List 9. Use
FuzzyQuery
Search
/* Search for emails that have word similar to ‘admnistrtor‘ in thesubject field. Note we have misspelled admnistrtor here.*/Query query = new FuzzyQuery(new Term("subject", "admnistrtor"));
-
QueryParser
-
QueryParser
It is very useful for parsing query characters manually entered. You can use it to parse the query expression entered by the user into Lucene query objects, which can be passed
IndexSearcher
. It can parse rich Query expressions.
QueryParser
Internally, the query string entered by people is converted into a specific query subclass. You need to use the backslash (
\
) Set
*
,
?
Escape. You can use operators
AND
,
OR
And
NOT
Construct a Boolean query of text. List 10. Search for manually entered Query expressions
QueryParser queryParser = new QueryParser("subject",new StandardAnalyzer());// Search for emails that contain the words ‘job openings‘ and ‘.net‘ and ‘pune‘Query query = queryParser.parse("job openings AND .net AND pune");
Back to Top
Show search results
IndexSearcher
Returns a group of references to the hierarchical search results (for example, matching a document in a given query. You can useIndexSearcher
To determine the maximum number of search results to be retrieved. On this basis, you can create custom pages. You can add custom web applications or desktop applications to display search results. The main types involved in search results include:ScoreDoc
AndTopDocs
.
-
ScoreDoc
-
The search result contains a simple pointer to the document. This can encapsulate the location of the document in the document index and the score calculated by Lucene.
-
TopDocs
-
Encapsulate search results and
ScoreDoc
.
The following code snippet shows how to retrieve documents contained in search results.
List 11. Display Search Results
/* First parameter is the query to be executed and second parameter indicates the no of search results to fetch */ TopDocs topDocs = indexSearcher.search(query,20); System.out.println("Total hits "+topDocs.totalHits); // Get an array of references to matched documents ScoreDoc[] scoreDosArray = topDocs.scoreDocs; for(ScoreDoc scoredoc: scoreDosArray){ //Retrieve the matched document and show relevant details Document doc = indexSearcher.doc(scoredoc.doc); System.out.println("\nSender: "+doc.getField("sender").stringValue()); System.out.println("Subject: "+doc.getField("subject").stringValue()); System.out.println("Email file location: "+doc.getField("emailDoc").stringValue()); }
Basic index operations
Basic indexing operations include removing and improving documents.
Remove document from Index
Applications often need to use the latest data to update indexes and remove older data. For example, in a Web search engine, indexes need to be updated regularly because new webpages are always required to be removed. Lucene providesIndexReader
This API allows you to perform these operations on indexes.
IndexReader
Is an abstract class that provides various methods to access indexes. Lucene uses the document number when referencing a document internally. This number can be changed when a document is added to or removed from an index. The document number is used to access the document in the index.IndexReader
It cannot be used to update the index in the directory because it has already been enabled.IndexWriter
.IndexReader
Always search for the index snapshot when you open it. Any changes to the index can be seen until the index is opened again.IndexReader
. Use Lucene to re-open theirIndexReader
You can see the latest Index Update.
List 12. delete a document from the index
// Delete all the mails from the index received in May 2009.IndexReader indexReader = IndexReader.open(indexDirectory);indexReader.deleteDocuments(new Term("month","05"));//close associate index files and save deletions to diskindexReader.close();
Document and field escalation
Sometimes you need to give some index data a higher level of importance. You can achieve this by setting the document or field lifting factor. By default, all documents and fields are upgraded by 1.0.
List 13. promoted Fields
if(subject.toLowerCase().indexOf("pune") != -1){// Display search results that contain pune in their subject first by setting boost factorsubjectField.setBoost(2.2F);}//Display search results that contain ‘job‘ in their sender email addressif(sender.toLowerCase().indexOf("job")!=-1){luceneDocument.setBoost(2.1F);}
Back to Top
Extended search
Lucene providesSort. You can sort the search results based on the fields that indicate the relative position of the document in the index. The fields used for sorting must be indexed but not marked. Four possible item values can be placed in the search field: integer, long, floating point, and string.
You can also sort search results by index order. Lucene sorts the results by reducing the degree of relevance (such as the default score. The order of sorting can be changed.
List 14. sort search results
/* Search mails having the word ‘job‘ in subject and return results sorted by sender‘s email in descending order. */SortField sortField = new SortField("sender", true);Sort sortBySender = new Sort(sortField);WildcardQuery query = new WildcardQuery(new Term("subject","job*"));TopFieldDocs topFieldDocs = indexSearcher.search(query,null,20,sortBySender);//Sorting by index ordertopFieldDocs = indexSearcher.search(query,null,20,Sort.INDEXORDER);
FilteringIs a process that limits the search space and allows only a subset of documents to be used as the search range. You can use this function to search the search results again, or to achieve security on the search results. Lucene comes with various built-in filters, such as booleanfilter, cachingwrapperfilter, chainedfilter, duplicatefilter, prefixfilter, querywrapperfilter, rangefilter, remotecachingwrapperfilter, and spanfilter.Filter
Can be passedIndexSearcher
To filter documents that match filtering criteria.
List 15. filter search results
/*Filter the results to show only mails that have sender field prefixed with ‘jobs‘ */Term prefix = new Term("sender","jobs");Filter prefixFilter = new PrefixFilter(prefix);WildcardQuery query = new WildcardQuery(new Term("subject","job*"));indexSearcher.search(query,prefixFilter,20);
Back to Top
Conclusion
Lucene is a very popular open-source search library from Apache. It provides powerful indexing and searching functions for applications. It provides a simple and easy-to-use API. You only need to know the indexing and searching principles. In this article, you have learned about the Lucene architecture and core APIs.
Lucene provides various powerful search functions for many well-known websites and organizations. It is also compatible with many other programming languages. Lucene has an active community of large technical users. If you need an easy-to-use, scalable, and high-performance open-source search library, Apache Lucene is an excellent choice.
Back to Top
Download
Description |
Name |
Size |
Lucene sample code |
Os-apache-lucenesearch-SampleApplication.zip |
755kb |
References
- Learn more about Apache Lucene, including the latest news.
- Lucene in action(Erik Hatcher and Otis gospodnetic) is Lucene's authoritative guide. It describes how to compile data indexes, including several types that you must understand, such as MS word, PDF, HTML, and XML. It describes how to search, sort, filter, and highlight search results.
- For interesting interviews and discussions for software developers, see the developerworks podcast.
- Stay tuned to developerworks technical events and network broadcasts.
- Check out recent seminars, trade exhibitions, network broadcasts, and other activities for IBM open source code developers to be held globally.
- Visit the open source code area on developerworks for a wide range of how-to information, tools, and project updates to help you develop with open source technology and use it with IBM products.
- See the free developerworks demonstration center and learn about IBM and open-source technologies and product features.