Search for text using Apache Lucene

Source: Internet
Author: User
Tags processing text

No mercy on good things

From http://www.ibm.com/developerworks/cn/opensource/os-apache-lucenesearch/

Introduction

Lucene is an open-source, highly scalable search engine library that can be obtained from the Apache Software Foundation. You can use Lucene for commercial and open-source applications. Lucene's powerful API mainly focuses on text indexing and search. It can be used to build search functions for various applications, such as email client, email list, Web search, and database search. Websites such as Wikipedia, theserverside, jguru, and LinkedIn all use Lucene.

Lucene also provides Eclipse IDE, nutch (the famous open source web search engine), and IBM®Companies such as AOL and Hewlett-Packard provide search functions. Lucene is compatible with many other programming languages, including Perl, Python, C ++, And. net. For Java™The latest version of Lucene in programming language is v2.4.1.

Lucene features:

  • Powerful, accurate, and effective search algorithms.
  • Calculate the scores of each document matching a given query and return the most relevant documents based on the scores.
  • Supports many powerful query types, such as phrasequery, wildcardquery, rangequery, fuzzyquery, and booleanquery.
  • Supports parsing rich Query expressions input by people.
  • Allows you to use custom sorting, filtering, and Query expressions to parse extended search behavior.
  • Use the file-based locking mechanism to protect concurrent index modifications.
  • Allows simultaneous search and indexing.

Back to Top

Use Lucene to build applications

1. Using Lucene to build a comprehensive search application mainly involves compiling data indexes, searching data, and displaying search results.

Figure 1. Steps to build an application using Lucene

This article selects some code snippets from the sample applications developed using Lucene v2.4.1 and Java technology. The example application compiles indexes for a group of e-mail documents stored in the property file and shows how to use Lucene's query API to search for indexes. This example also familiarizes you with basic index operations.

Back to Top

Index data

Lucene allows you to index data in any text format. Lucene can be used for almost any data source and text information extracted from it. You can use Lucene to compile indexes and search for HTML documents, Microsoft®Data stored in word and PDF files. The first step in data indexing is to convert data into a simple text format. You can use a custom parser and data converter to achieve this.

Indexing Process

IndexingConverts text data to a format that facilitates quick search. This is similar to the index behind a book: it shows you where the topic appears in the book.

Lucene stores the input data inReverse OrderIn the indexed data structure, the data structure is stored in the file system or memory in the form of an indexed file set. Most Web search engines use reverse indexing. It allows you to perform a quick keyword query to find documents that match a given query. Before adding text data to the index, the Analysis Program (using the analysis process) processes the data.

Analysis

AnalysisIs the basic unit for converting text data to search (calledTerm). In the analysis process, text data goes through multiple operations: extracting words, removing common words, ignoring punctuation marks, turning words into the root form, and changing words to lowercase letters. The analysis process occurs before indexing and query parsing. Analysis converts text data to tags that are added as items to Lucene indexes.

Lucene has a variety of built-in analysis programs, such as simpleanalyzer, standardanalyzer, stopanalyzer, and snowballanalyzer. They differ in the way text is tagged and filters are applied. Because the analysis removes words before indexing, it reduces the size of the index, but does not use the exact Query Process. You can use the basic building blocks provided by Lucene to create custom analysis programs and control the analysis process in your own way. Table 1 shows some built-in analysis programs and their data processing methods.

Table 1. built-in Lucene analysis program
Analytic program Operations on text data
Whitespaceanalyzer Mark the blank space
Simpleanalyzer Break down non-Letter text and convert the text into lowercase letters
Stopanalyzer Remove stop word: useless words for retrieval and convert the text to lowercase
Standardanalyzer Mark text based on a complex syntax (identifying email addresses, abbreviations, Chinese characters, Japanese, Korean characters, letters, numbers, and so on)
Convert text to lowercase
Remove a VM word
Core Index compilation
Directory
The abstract class that represents the storage location of the index file. There are two common subclasses:
  • FSDirectory-The index is stored in the actual file system.Directory. This type is very useful for large indexes.
  • RAMDirectory-All indexes are stored in the memory. This class applies to small indexes, which can be fully loaded into the memory and destroyed after the application is terminated. Because the index is stored in the memory, the speed is relatively fast.
Analyzer
As described above, the analyzer is responsible for processing text data and converting it into tags stored in the index. Before indexing, IndexWriterReceives the analytic program used to mark the data. To index a text, you should use an analysis program applicable to the text language.

The default analyzer is applicable to English. There are other analysis programs in the Lucene sandbox, including those for Chinese, Japanese, and Korean.

IndexDeletionPolicy
This interface is used to customize the deletion of outdated commit policies from the index directory. The default deletion policy is KeepOnlyLastCommitDeletionPolicy, This policy only keeps the most recent submissions, and immediately removes all previous submissions after some commits are completed.
IndexWriter
The class that creates or maintains the index. Its constructor receives a Boolean value to determine whether to create a new index or open an existing index. It provides methods to add, delete, and update documents in an index.

Changes made to indexes are initially cached in memory and periodically dumped to the index directory.IndexWriterIt discloses several fields that control how to cache indexes in the memory and write them into the disk. Changes to the indexIndexReaderInvisible unless you callIndexWriter.IndexWriterCreate a directory lock file to protect the index from being damaged by synchronizing index updates.IndexWriterYou can specify an optional index deletion policy.

List 1. Use Lucene IndexWriter
//Create instance of Directory where index files will be storedDirectory fsDirectory =  FSDirectory.getDirectory(indexDirectory);/* Create instance of analyzer, which will be used to tokenizethe input data */Analyzer standardAnalyzer = new StandardAnalyzer();//Create a new indexboolean create = true;//Create the instance of deletion policyIndexDeletionPolicy deletionPolicy = new KeepOnlyLastCommitDeletionPolicy(); indexWriter =new IndexWriter(fsDirectory,standardAnalyzer,create,deletionPolicy,IndexWriter.MaxFieldLength.UNLIMITED);
Add data to index

Adding text data to an index involves two classes.

FieldIt indicates the data slices that are queried or retrieved in the search.FieldClass encapsulates a field name and its value. Lucene provides some options to specify whether the field needs to be indexed or analyzed, and whether the value needs to be stored. These options can be passed when a field instance is created. The following table showsFieldThe details of the metadata option.

Table 2. FieldMetadata option details
Option Description
Field. Store. Yes Stores field values. Applicable to fields that display search results, such as file paths and URLs.
Field. Store. No No field value is stored-for example, the body of the email message.
Field. Index. No Applicable to unsearched fields-used only to store fields, such as file paths.
Field. Index. Analyzed Used for field indexing and analysis-for example, the body and title of an email message.
Field. Index. not_analyzed Fields used for indexing but not analysis. It retains the original value of the field in the whole-for example, the date and personal name.

DocumentIs a set of fields. Lucene also supports promoting documents and fields, which is useful when attaching importance to some index data. Indexing text files includes encapsulating text data in fields, creating documents, and filling fields.IndexWriterAdd a document to the index.

List 2 shows an example of adding data to an index.

List 2. Add data to the index
/*Step 1. Prepare the data for indexing. Extract the data. */String sender = properties.getProperty("sender");String date = properties.getProperty("date");String subject = properties.getProperty("subject");String message = properties.getProperty("message");String emaildoc = file.getAbsolutePath();/* Step 2. Wrap the data in the Fields and add them to a Document */Field senderField =new Field("sender",sender,Field.Store.YES,Field.Index.NOT_ANALYZED);Field emaildatefield = new Field("date",date,Field.Store.NO,Field.Index.NOT_ANALYZED); Field subjectField = new Field("subject",subject,Field.Store.YES,Field.Index.ANALYZED);Field messagefield = new Field("message",message,Field.Store.NO,Field.Index.ANALYZED);Field emailDocField =new Field("emailDoc",emaildoc,Field.Store.YES,Field.Index.NO);Document doc = new Document();// Add these fields to a Lucene Documentdoc.add(senderField);doc.add(emaildatefield);doc.add(subjectField);doc.add(messagefield);doc.add(emailDocField);//Step 3: Add this document to Lucene Index.indexWriter.addDocument(doc);

Back to Top

Search index data

Searching is the process of searching for words in the index and searching for documents containing these words. The search function built using Lucene's search API is very simple and clear. This section describes the main types of Lucene search APIs.

Searcher

SearcherIs an abstract base class that contains various overload search methods.IndexSearcherIs a common subclass that allows you to store search indexes in a given directory.SearchReturns a set of documents sorted by scores. Lucene calculates scores for each document matching a given query.IndexSearcherIs thread-safe; an instance can be used concurrently by multiple threads.

Term

TermIs the basic unit of search. It consists of two parts: the word text and the name of the field that appears the text. The term object also involves indexing, but can be created in Lucene.

Query and subclass

QueryIs an abstract base class for query. To search for a specified word or phrase involves wrapping them in the item, adding the item to the query object, and passing the query objectIndexSearcher.

Lucene contains various types of specific query implementations, such as termquery, booleanquery, phrasequery, prefixquery, rangequery, multitermquery, filteredquery, and spanquery. The following section describes the main Query Class of Lucene query API.

TermQuery
The most basic search type of an index. Can be built using a single item TermQuery. Item values are case sensitive, but not all of them are case sensitive. Note that the passed search items should be consistent with the items obtained from the document analysis, because the analysis program performs many operations on the original text before building the index.

For example, consider the email title "job openings for Java javassionals at Bangalore ". Assume that you useStandardAnalyzerCompile the index. If we useTermQuerySearch for "Java", it will not return any content, because this text should have been normalized and passedStandardAnalyzerTo lowercase. If you search for the lowercase word "Java", it returns all emails containing the word in the title field.

List 3. Use TermQuerySearch
//Search mails having the word "java" in the subject fieldSearcher indexSearcher = new IndexSearcher(indexDirectory);Term term = new Term("subject","java");Query termQuery = new TermQuery(term); TopDocs topDocs = indexSearcher.search(termQuery,10);
RangeQuery
You can use RangeQuerySearch within a certain range. All items in the index are alphabetically arranged. Lucene's RangeQueryAllows you to search for items within a certain range. This range can be specified using the start and end items (including both ends or not. List 4. Search within a certain range
/* RangeQuery example:Search mails from 01/06/2009 to 6/06/2009 both inclusive */Term begin = new Term("date","20090601");Term end = new Term("date","20090606");Query query = new RangeQuery(begin, end, true);
PrefixQuery
You can use PrefixQuerySearch by prefix words. This method is used to create a query that searches for documents that contain words starting with a specified word prefix. List 5. Use PrefixQuerySearch
//Search mails having sender field prefixed by the word ‘job‘PrefixQuery prefixQuery = new PrefixQuery(new Term("sender","job"));PrefixQuery query = new PrefixQuery(new Term("sender","job"));
BooleanQuery
You can use BooleanQueryCombine any number of query objects to build a powerful query. It uses queryAnd a clause associated with the query, indicating whether the query should occur, must occur or not. In BooleanQueryThe maximum number of clauses is 1,024 by default. You can call setMaxClauseCountMethod to set the maximum number of clauses. LIST 6. Use BooleanQuerySearch
// Search mails have both ‘java‘ and ‘bangalore‘ in the subject fieldQuery query1 = new TermQuery(new Term("subject","java"));Query query2 = new TermQuery(new Term("subject","bangalore"));BooleanQuery query = new BooleanQuery();query.add(query1,BooleanClause.Occur.MUST);query.add(query2,BooleanClause.Occur.MUST);
PhraseQuery
You can use PhraseQuerySearch for phrases. PhraseQueryMatch a document containing a specific word sequence. PhraseQueryUse the location information of the items stored in the index. Considering the distance between matched items Slop. By default, the slop value is zero, which can be called setSlopMethod. PhraseQueryMultiple phrases are also supported. List 7. Use PhraseQuerySearch
/* PhraseQuery example: Search mails that have phrase ‘job opening j2ee‘   in the subject field.*/PhraseQuery query = new PhraseQuery();query.setSlop(1);query.add(new Term("subject","job"));query.add(new Term("subject","opening"));query.add(new Term("subject","j2ee"));
WildcardQuery
WildcardQueryImplement wildcard search and query, which allows you to search for words such as arch * (which can search for words such as architect and architecture. Use two standard wildcards:
  • *Zero or more
  • ?Indicates more than one
If you use the search mode that starts with a wildcard query, the performance may be degraded because you need to query all the items in the index to find matching documents. List 8. Search Using wildcardquery
//Search for ‘arch*‘ to find e-mail messages that have word ‘architect‘ in the subjectfield./Query query = new WildcardQuery(new Term("subject","arch*"));
FuzzyQuery
You can use FuzzyQuerySearch for similar items, which match words similar to specified words. The similarity measurement is based on the levenshtein algorithm. In List 9, FuzzyQueryIt is used to find the closest item to the spelling word "admnistrtor", even though the word is not indexed. List 9. Use FuzzyQuerySearch
/* Search for emails that have word similar to ‘admnistrtor‘ in thesubject field. Note we have misspelled admnistrtor here.*/Query query = new FuzzyQuery(new Term("subject", "admnistrtor"));
QueryParser
QueryParserIt is very useful for parsing query characters manually entered. You can use it to parse the query expression entered by the user into Lucene query objects, which can be passed IndexSearcher. It can parse rich Query expressions. QueryParserInternally, the query string entered by people is converted into a specific query subclass. You need to use the backslash ( \) Set *, ?Escape. You can use operators AND, ORAnd NOTConstruct a Boolean query of text. List 10. Search for manually entered Query expressions
QueryParser queryParser = new QueryParser("subject",new StandardAnalyzer());// Search for emails that contain the words ‘job openings‘ and ‘.net‘ and ‘pune‘Query query = queryParser.parse("job openings AND .net AND pune");

Back to Top

Show search results

IndexSearcherReturns a group of references to the hierarchical search results (for example, matching a document in a given query. You can useIndexSearcherTo determine the maximum number of search results to be retrieved. On this basis, you can create custom pages. You can add custom web applications or desktop applications to display search results. The main types involved in search results include:ScoreDocAndTopDocs.

ScoreDoc
The search result contains a simple pointer to the document. This can encapsulate the location of the document in the document index and the score calculated by Lucene.
TopDocs
Encapsulate search results and ScoreDoc.

The following code snippet shows how to retrieve documents contained in search results.

List 11. Display Search Results
/* First parameter is the query to be executed and    second parameter indicates the no of search results to fetch */   TopDocs topDocs = indexSearcher.search(query,20);   System.out.println("Total hits "+topDocs.totalHits);   // Get an array of references to matched documents   ScoreDoc[] scoreDosArray = topDocs.scoreDocs;   for(ScoreDoc scoredoc: scoreDosArray){      //Retrieve the matched document and show relevant details      Document doc = indexSearcher.doc(scoredoc.doc);      System.out.println("\nSender: "+doc.getField("sender").stringValue());      System.out.println("Subject: "+doc.getField("subject").stringValue());      System.out.println("Email file location: "+doc.getField("emailDoc").stringValue());   }
Basic index operations

Basic indexing operations include removing and improving documents.

Remove document from Index

Applications often need to use the latest data to update indexes and remove older data. For example, in a Web search engine, indexes need to be updated regularly because new webpages are always required to be removed. Lucene providesIndexReaderThis API allows you to perform these operations on indexes.

IndexReaderIs an abstract class that provides various methods to access indexes. Lucene uses the document number when referencing a document internally. This number can be changed when a document is added to or removed from an index. The document number is used to access the document in the index.IndexReaderIt cannot be used to update the index in the directory because it has already been enabled.IndexWriter.IndexReaderAlways search for the index snapshot when you open it. Any changes to the index can be seen until the index is opened again.IndexReader. Use Lucene to re-open theirIndexReaderYou can see the latest Index Update.

List 12. delete a document from the index
// Delete all the mails from the index received in May 2009.IndexReader indexReader = IndexReader.open(indexDirectory);indexReader.deleteDocuments(new Term("month","05"));//close associate index files and save deletions to diskindexReader.close();
Document and field escalation

Sometimes you need to give some index data a higher level of importance. You can achieve this by setting the document or field lifting factor. By default, all documents and fields are upgraded by 1.0.

List 13. promoted Fields
if(subject.toLowerCase().indexOf("pune") != -1){// Display search results that contain pune in their subject first by setting boost factorsubjectField.setBoost(2.2F);}//Display search results that contain ‘job‘ in their sender email addressif(sender.toLowerCase().indexOf("job")!=-1){luceneDocument.setBoost(2.1F);}

Back to Top

Extended search

Lucene providesSort. You can sort the search results based on the fields that indicate the relative position of the document in the index. The fields used for sorting must be indexed but not marked. Four possible item values can be placed in the search field: integer, long, floating point, and string.

You can also sort search results by index order. Lucene sorts the results by reducing the degree of relevance (such as the default score. The order of sorting can be changed.

List 14. sort search results
/* Search mails having the word ‘job‘ in subject and return results   sorted by sender‘s email in descending order. */SortField sortField = new SortField("sender", true);Sort sortBySender = new Sort(sortField);WildcardQuery query = new WildcardQuery(new Term("subject","job*"));TopFieldDocs topFieldDocs = indexSearcher.search(query,null,20,sortBySender);//Sorting by index ordertopFieldDocs = indexSearcher.search(query,null,20,Sort.INDEXORDER);

FilteringIs a process that limits the search space and allows only a subset of documents to be used as the search range. You can use this function to search the search results again, or to achieve security on the search results. Lucene comes with various built-in filters, such as booleanfilter, cachingwrapperfilter, chainedfilter, duplicatefilter, prefixfilter, querywrapperfilter, rangefilter, remotecachingwrapperfilter, and spanfilter.FilterCan be passedIndexSearcherTo filter documents that match filtering criteria.

List 15. filter search results
/*Filter the results to show only mails that have sender field prefixed with ‘jobs‘ */Term prefix = new Term("sender","jobs");Filter prefixFilter = new PrefixFilter(prefix);WildcardQuery query = new WildcardQuery(new Term("subject","job*"));indexSearcher.search(query,prefixFilter,20);

Back to Top

Conclusion

Lucene is a very popular open-source search library from Apache. It provides powerful indexing and searching functions for applications. It provides a simple and easy-to-use API. You only need to know the indexing and searching principles. In this article, you have learned about the Lucene architecture and core APIs.

Lucene provides various powerful search functions for many well-known websites and organizations. It is also compatible with many other programming languages. Lucene has an active community of large technical users. If you need an easy-to-use, scalable, and high-performance open-source search library, Apache Lucene is an excellent choice.

Back to Top

Download
Description Name Size
Lucene sample code Os-apache-lucenesearch-SampleApplication.zip 755kb

References
  • Learn more about Apache Lucene, including the latest news.
  • Lucene in action(Erik Hatcher and Otis gospodnetic) is Lucene's authoritative guide. It describes how to compile data indexes, including several types that you must understand, such as MS word, PDF, HTML, and XML. It describes how to search, sort, filter, and highlight search results.
  • For interesting interviews and discussions for software developers, see the developerworks podcast.
  • Stay tuned to developerworks technical events and network broadcasts.
  • Check out recent seminars, trade exhibitions, network broadcasts, and other activities for IBM open source code developers to be held globally.
  • Visit the open source code area on developerworks for a wide range of how-to information, tools, and project updates to help you develop with open source technology and use it with IBM products.
  • See the free developerworks demonstration center and learn about IBM and open-source technologies and product features.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.