HTTP: // Blog.sina.com.cn/s/blog_5de48f8b0100dple.html
Lucene is not a ready-madeProgram, Similar to a file search program, a web crawler, or a website search engine. Lucene is a software library and a development kit, rather than a search application with complete features. It only focuses on text indexing and searching. Lucene allows you to add indexing and search capabilities for your applications. Currently, many application search functions are based on Lucene, such as the search function of the eclipse help system.
Lucene adopts an inverted index mechanism. Reverse indexing means that we maintain a word/phrase table. For each word/phrase in this table, a linked list describes which documents contain the word/phrase. In this way, you can quickly obtain search results when entering query conditions.
After the document has created an index, you can search for these indexes. The search engine first parses the search keywords, then searches for the created indexes, and finally returns the documents associated with the keywords entered by the user.
Today, in the course of Chuanzhi podcast, instructor Tang sunshine taught us to implement a simple Lucene search engine, which enables us to find a large number of documents to meet different needs. The following is my summary.
--------------------------------------------------------------------------------
1. Prepare the environment: Add a jar package
Lucene-core-2.4.0. Jar (CORE );
Lucene-analyzers-2.4.0. Jar (word divider );
Lucene-highlighter-2.4.0. Jar (highlighted );
--------------------------------------------------------------------------------
2. Construct indexwriter. Indexwriter is a core class used by Lucene to create indexes. Use the constructor indexwriter (Directory D, analyzer A, maxfieldlength MFL). If the index does not exist, it will be created.
* Parameter description
<1> Directory represents the storage location of Lucene indexes. This is an abstract class. There are two commonly used implementations. The first is fsdirectory, which indicates an index location stored in the file system. The second is ramdirectory, which indicates an index location stored in the memory.
<2> Analyzer: Before a document is indexed, you must first perform word segmentation on the document content. This is done by analyzer. The analyzer class is an abstract class that has multiple implementations. Select the appropriate analyzer for different languages and applications. Analyzer submits the segmented content to indexwriter to create an index.
<3> Maxfieldlength: used to limit the size of a field. This variable allows you to systematically intercept large file fields. If the value is 10000, only the first 10000 terms (keywords) of each field are indexed ). That is to say, only the first 10000 terms (keywords) in each field are indexed. Other fields are not indexed by Lucene, and cannot be searched.
--------------------------------------------------------------------------------
3. Create an index. Use indexwriter. adddocument (document DOC ).
* Parameter description
<1> Document is used to describe the structure of Lucene documents. Any data to be indexed must be converted to a document object. Document is the most basic unit for indexing and searching. It is a set of fields.
<2> Field: a document element used to describe a certain attribute of a document. For example, the title and content of an email can be described by two Field objects. Field is composed of name and value, and value only accepts strings (non-string type must be converted into strings first ). Specify the store and index when constructing the field.
Field. Store: Specifies whether or how the field is stored.
Store. No, not stored.
Store. Yes, storage.
Store. Compress, compressed and stored (this can be used when the data volume is large, but efficiency should be considered ).
Field. Index: Specifies whether or how the field is indexed.
Index. No, no index (cannot be searched without an index ).
Index. analyzed (tokenized in earlier versions), which is an index after word segmentation.
Index. not_analyzed (un_tokenized in earlier versions), without word segmentation, direct index (the entire field is used as a term ).
Note: After completing the index operation, you must call the indexwriter. Close () method.
--------------------------------------------------------------------------------
4, Delete the index:
Indexwriter. deletedocuments (Term term );
All documents containing the specified term in the index file will be deleted. Term is the basic unit of search. Indicates a keyword in a field.
--------------------------------------------------------------------------------
5, Update index:
Indexwriter. updatedocument (Term term, document DOC );
In fact, the index is deleted before being created. That is to say, if there are multiple qualified documents, there is only one updated document.
--------------------------------------------------------------------------------
6, Search:
Use the class indexsearcher. The query method is as follows:
Indexsearcher. Search (query, filter,Int);
Parameter description:
<1> Query: query object. The query string entered by the user is encapsulated into a query that Lucene can recognize.
<2> Filter: Used to overwrite the search results.
<2> The third parameter (INT type) is the maximum number of returned documents.
The returned result is a topdocs type. You can call topdocs. scoredocs to obtain the query result.
Scoredoc.doc returns the internal number of the document.
Indexsearcher.doc(hitspolici2.16.doc.
Query can be generated by queryparser parsing the query string. The constructor is queryparser (string defaultfieldname, analyzer A). The first parameter is the field queried by default, the second parameter is the word divider used (the word divider used here must be consistent with the word divider used to create an index, otherwise the result may not be searched ). Use the parse (string) method to parse the query content.
RelatedCode:
List <document> docs =NewArraylist <document> ();
Indexsearcher =Null;
Try{
Indexsearcher =NewIndexsearcher (DIR );
Filter filter =Null;
IntNdocs =10000;
topdocs = indexsearcher. search (query, filter, ndocs);
system. out . println ( " total [ " + topdocs. totalhits + " ] matching records " );
for ( int I = 0 ; I scoredoc = topdocs. scoredocs [I];
int docnum = scoredoc.doc; // document IDs in the index library
document DOC = indexsearcher.doc (docnum); // retrieve the corresponding document by serial number
docs. add (DOC);
}
return New searchresult (topdocs. totalhits, Docs);
}< span style = "color: # 0000ff"> catch (exception e) {
throw New runtimeexception (E );
}< span style = "color: # 0000ff"> finally {< br> try {
indexsearcher. close ();
}< span style = "color: # 0000ff"> catch (ioexception e) {
E. printstacktrace ();
}< BR >}
--------------------------------------------------------------------------------
7To test eindexdexdao's add, delete, modify, and query method, and use eineindexdao as an exercise. You must pass the unit test in lucendexdexdaotest. The relevant test code is not listed one by one due to the large number of tests.
--------------------------------------------------------------------------------
8The following describes some important classes and terms:
<1> Directory:
A) fsdirectory: place the index to the disk.
B) ramdirectory: place the index in the memory. High speed, but the memory index does not exist after the JVM exits. You can call another fsdirectory indexwriter to transfer the indexes in the memory to the file system before the JVM exits.
The corresponding API is indexwriter. addindexesnooptimize (directory []). Note that this call code should be placed in the memory index operation after the indexwriter of ramdirectory is disabled so that all documents can enter ramdirectory. When creating a ramdirectory instance, you can use a construction method without parameters or a construction method with parameters: You can specify a file path to load the index on the disk to the memory.
<2> Relevance sorting: Lucene search results are sorted by relevance by default. relevance is the document score. Lucene has a scoring mechanism, that is, to evaluate the search results according to certain standards, and then sort the results according to the scores.
A) the score of the document is related to the keyword entered by the user and is the result of real-time calculation. The score is affected by the location and number of times of keyword in the document.
B) Boost can be used to affect the sorting of Lucene query results. By setting document boost, document weights can be affected to control the order of query results: Document. setboost (Float). The default value is 1f. The larger the value, the higher the score.
C) You can also specify boost for filed during query. When the same term belongs to different fields, if the field boost is different, the score of the file to which it belongs is different.
<3> Analyzer: A Word divider that splits text resources into the smallest unit (keyword) that can be indexed by rule ).
For English word segmentation, it is easy to split words by punctuation, blank spaces, and Chinese Word Segmentation is complicated, for example, "the People's Republic of China" can be divided into "China", "people", and "Republic", but it should not be "Chinese" (not in semantics ).
Common Chinese Word Segmentation Methods:
A) Word Segmentation means Word Segmentation Based on a Chinese character. For example, if we are Chinese, the effect is: We \ People \ China \ country \ people. (Standardanalyzer ).
B) bipartite: split by two words. For example, if we are Chinese, the effect is: we are Chinese, Chinese, and Chinese. (Cjkanalyzer ).
C) Word Segmentation: according to a certainAlgorithmConstruct a word and match the dictionary set. If the word set matches, it is split into words. Word Segmentation is generally considered to be the most ideal Chinese word segmentation algorithm, for example, we are Chinese and the effect is: We \ are \ China \ Chinese.
<4> Highlighter: Used to highlight matching keywords.
There are three main parts:
1) Formatter: formatter (prefix and suffix are specified during construction using simplehtmlformatter ).
2) Splitter: scorer (using queryscorer ).
3Segment splitter: fragmenter (use simplefragmenter to specify the length of the content segment where the keyword is located during construction ).
Use the getbestfragment (analyzer A, string fieldname, string text) method to highlight data. (If NO keyword exists in the highlighted field, null is returned ).
<5> Query: You can use query parser to parse query strings or use APIs to construct a query.
Query parser is not case sensitive. Common queries are as follows:
1) Termquery, by term (keyword) query (the value of the term should be the final keyword, and all English letters should be in lowercase ).
Termquery (term t );
Syntax: propertyname: keyword;
2) Rangequery: Specifies the range query.
Rangequery (term lowerterm, term upperterm, Boolean partial SIVE );
Syntax: propertyname: [lower to upper] (including lower and upper );
Syntax: propertyname: {lower to upper} (excluding lower and upper );
3) Prefixquery: prefix query.
Prefixquery (term prefix );
Syntax: propertyname: prefix *;
4) Wildcardquery, wildcard query, can be used"?"Represents a character,"*"Represents 0 or multiple characters. (Wildcards cannot appear at the first position)
Wildcardquery (Term term );
Syntax: propertyname: chars * chars? Chars
5 ) Multifieldqueryparser, which can be queried in multiple fields.
Queryparser = multifieldqueryparser (string [] fields, analyzer );
Query query = parse (string query );
6 ) Booleanquery: Boolean query. This is a combined query, which can add various queries and indicate their logical relationships.
Termquery Q1 = New Termquery ( New Term ( " Title " , " Javadoc " ));
Termquery q2 = New Termquery ( New Term ( " Content " , " Term " ));
Booleanquery boolquery = New Booleanquery ();
Boolquery. Add (Q1, occur. Must );
Boolquery. Add (Q2, occur. must_not );
Occur. Must, must appear.
Occur. must_not, must not appear.
Occur. shocould. It must appear when there is only one shocould, or the relationship when there is more than one.
Syntax: +-and not or (uppercase must be used ).