Http://gaowenming.iteye.com/blog/937963
Create an index
To index documents, Lucene provides five basic classes: Document, field, indexwriter, analyzer, and directory. The following describes the usage of these five classes:
Document
Document is used to describe a document. The document here can refer to an HTML page, an email, or a text file. A document object consists of multiple field objects. You can think of a document object as a record in the database, and each Field object is a record field.
Field
The Field object is used to describe a certain attribute of a document. For example, the title and content of an email can be described by two Field objects.
Analyzer
Before a document is indexed, you must first perform word segmentation on the document content, which is done by analyzer. The analyzer class is an abstract class that has multiple implementations. Select the appropriate analyzer for different languages and applications. Analyzer submits the segmented content to indexwriter to create an index.
Indexwriter
Indexwriter is a core class used by Lucene to create indexes. It is used to add document objects to indexes.
Directory
This class represents the storage location of Lucene indexes. This is an abstract class. It currently has two implementations. The first is fsdirectory, which represents the location of an index stored in the file system. The second is ramdirectory, which indicates the location of an index stored in the memory.
After getting familiar with the classes required to create an index, we began to index the text files under a directory, listing 1 shows the source code for indexing text files in a directory.
Instance:
Java code
- // Ik tokenizer
- Private analyzer = new ikanalyzer (false );
- Private document;
- Private indexwriter writer;
- Private Static file indexfile = new file ("D: \ Index ");
- // Word Segmentation
- Tokenstream = analyzer. reusabletokenstream ("text ",
- New stringreader ("People's Republic of China "));
- Termattribute term = (termattribute) tokenstream
- . Getattribute (termattribute. Class );
- While (tokenstream. incrementtoken ()){
- System. Out. println (term. term ());
- }
- // Create an index
- /**
- *
- * Fsdirectory. Open (indexfile) the path where the index file is stored by analyzer.
- * True indicates creation, and false indicates appending indexwriter. maxfieldlength. Limited indicates the maximum value of word segmentation.
- * For example, new maxfieldlength (2) indicates two words in one minute. Generally
- * Indexwriter. maxfieldlength. Limited
- */
- Writer = new indexwriter (fsdirectory. Open (indexfile), analyzer, false,
- Indexwriter. maxfieldlength. Limited );
- Document = new document ();
- /**
- * Create a field object and write the name: Field object name (content) in the document)
- * Value: Field object value (People's Republic of China) store: whether to store index: Word Segmentation Index
- */
- Document. Add (new field ("content ",
- "Hello Beijing ",
- Field. Store. Yes, field. Index. Analyzed ));
- Document
- . Add (new field ("ID", "3", field. Store. Yes, field. Index. Analyzed ));
- Writer. adddocument (document );
- Writer. Close ();
Tokenstream is the result of word segmentation. You can obtain the result set of the current word segmentation.
Search documents
Using Lucene for search is just as convenient as creating an index. In the above section, we have created an index for text documents in a directory. Now we need to search for documents containing a keyword or phrase on this index. Lucene provides several basic classes to complete this process. They are indexsearcher, term, query, termquery, and hits. The functions of these classes are described below.
Query
This is an abstract class. It has multiple implementations, such as termquery, booleanquery, and prefixquery. This class aims to encapsulate the query string entered by the user into a query that Lucene can recognize.
Indexsearcher
Indexsearcher is used to search for a created index. It can only open one index in read-only mode, so multiple indexsearcher instances can operate on one index.
Sort
When using sort, the object is instantiated as a parameter and implemented through the search interface of the searcher class. The sorting function supported by sort is based on the fields in the document. This method can be used to sort values of one or more different fields in multiple forms.
Sort by object sort. There are two main modes: one is to use a string to indicate the name of the document domain as the parameter to specify the domain order, and the other is to directly sort the packaging class of the packaging domain of the sorting domain as the parameter.
Topdocs
The search result set that stores the document set of the composite condition.
Highlighter
Highlight of the matching value. When the document matches the keyword, the result is highlighted.
Java code
- /**
- * The query object is provided by IK.
- *
- */
- Query query = ikqueryparser. parse ("content", "Beijing ");
- // Index queryer
- Indexsearcher searcher = new indexsearcher (fsdirectory. Open (indexfile ));
- Searcher. setsimilarity (New iksimilarity ());
- /**
- * Sort search results field: Field Type type:. sortfield type boolean: false ascending, true: Descending
- */
- Sort = new sort ();
- Sort. setsort (New sortfield ("ID", sortfield. Int, false ));
- Topdocs docs = searcher. Search (query, null, 10, sort );
- System. Out. println ("Number of matched objects:" + docs. totalhits );
- Scoredoc [] S = docs. scoredocs;
- Highlighter = NULL;
- Simplehtmlformatter = new simplehtmlformatter (
- "<B> <font color = 'red'>", "</font> </B> ");
- Highlighter = new highlighter (simplehtmlformatter, new queryscorer (
- Query ));
- // This 100 is the length of the context of the specified keyword string
- Highlighter. settextfragmenter (New simplefragmenter (100 ));
- List <message> List = new arraylist <message> ();
- For (INT I = 0; I <S. length; I ++ ){
- Document d = searcher.doc(spolici2.16.doc );
- System. Out. println (D. Get ("ID "));
- System. Out. println (D. Get ("content "));
- String content = highlighter. getbestfragment (analyzer, "content", d
- . Get ("content "));
- System. Out. println (content );
- List. Add (New message (D. Get ("ID"), content ));
- }
- System. Out. println (list. Size ());
- Searcher. Close ();
- }