See Lucene home page (http://lucene.apache.org/) on the current Lucene has to 4.9.0 version, refer to the learning book is explained in accordance with version 2.1, write the code example is used 3.0.2 version, version
Different causes some methods to be used differently, but they are basically the same.
Jar package used by source code (version 3.0.2)
References:
1. company internal training materials
2. Tian nun, an authoritative Lucene search engine development classic.
Lucene is easy to use and can be learned after reading it patiently, as well as the source code.
I. Basic Indexing Methods
The basic architecture and principles of all open-source search engines are similar, and Lucene is no exception. using it to build a search engine is also the four basic problems to be solved: capture Data, parse data, create indexes, and perform search.
1. Understand the process of creating an index
In the reference books, there is a metaphor for creating an index:
The process of creating an index can be analogous to writing a collection. The following describes how to write a collection. The collection contains many articles, including the title, content, name, and time of writing.
We use the following method to write this collection: Write an article first, and then integrate the article.
First, add the title, content, and writing time of each article to write an article.
Then add each article to the book, so that the collection is well written.
Shows the structure of the Collection: In the left-to-right direction, it is to read the collection, that is, to open a book, and then read the articles in it; in the right-to-left direction, it is to write the collection.
The process of creating an index is as follows:
(1) Create indexwriter, which is equivalent to the framework of a book.
(2) create a Document Object document, which is equivalent to an article
(3) create an Information Field object field, which is equivalent to different information in an article (title, body, etc ).
(4) Add the field to the document.
(5) Add the document to indexwriter.
(6) Disable indexwriter.
As shown in, the index structure is read from left to right (that is, search ). Create an index from right to left:
Follow these steps to create an index:
(1) create a field and package different information of the article.
(2) organize multiple fields into a document to package an article.
(3) organize multiple documents into one indexwriter, that is, assemble multiple articles to form an index.
The following three sections describe how to create an index based on the basic steps for creating an index.
2. Create a field
There are many methods to create a field. The following are the most common methods.
Field field = new field (field name, field content, storage method, indexing method );
The meanings of these four parameters are as follows:
(1) The field name is the field name, similar to the field name of the data table.
(2) The field content is the content of the field, similar to the field content of the database table.
(3) There are three storage methods: field. Store. No, field. Store. Yes, and field. Store. compress ).
Generally, you can use full storage if you don't worry about the index being too large. However, for performance considerations, the smaller the index file content, the better. Therefore, if the field content is very small, it uses full storage (such
If there is a lot of field content, it will be stored in a way that is not stored or compressed, such as text.
(4) There are four indexing methods:
No index (field. index. no), index, but not analysis (field. index. no_norms), index, but not word segmentation (field. index. un_tokenized), word segmentation, and index (field. index. tokenized ).
3. Create a document
The document creation method is as follows:
Document Doc = new document ();
This method is used to create an empty document without any field.
If you want to add the field to the document, you only need to add the method. For example, Doc. Add (field );
You can add multiple fields to a document after repeated use.
4. Create indexwriter
There are also many ways to create indexwriter. A common method is as follows:
<span style="font-family:SimSun;font-size:12px;"><span style="font-family:SimSun;font-size:12px;">File indexDir = new File("E:\\Index");Directory dir = new SimpleFSDirectory(indexDir);Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);IndexWriter indexWriter = new IndexWriter(dir, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);</span></span>
Introduction to several parameters:
(1) Directory (directory type) is an abstract class
Its direct subclass is different in different versions. Let's talk about version 3.0.2. Its subclass includes fsdirectory, ramdirectory, fileswitchdirectory, and compoundfilereader. The two subclasses represent different directory types.
Fsdirectory: a path in the file system that directly writes the index to the disk.
Ramdirectory: a region in the memory. The content disappears after the VM exits. Therefore, you need to transfer the content in ramdirectory to fsdirectory.
Fileswitchdirectory: one is the fileswitchdirectory that can read files in two different directories at the same time. This is a proxy class.
Compoundfilereader: The compoundfilereader that reads the compoundfilereader of the composite file. Only files with the CFS extension can be read. (Compoundfilewriter is used to write files with the CFS extension.) compoundfilereader is only referenced in segmentreader.
Fsdirectory is divided into three categories:
A. simplefsdirectory in Windows
B. Linux supports NiO niofsdirectory
C. Another is the memory map directory mmapdirectory.
(2) abstract class of analyzer
Analyzes various input data sources, including filtering and word segmentation.
Analyzer is used for lexical analysis, including English analyzer and Chinese analyzer. Select an appropriate Analyzer Based on the file of the index to be created. Commonly used are standardanalyzer (Standard analyzer) and cjkanalyzer (binary segmentation ).
And chineseanalyzer (Chinese analyzer.
You can edit the analyzer as needed to process different languages (of course, I won't ).
After indexwriter is created, you can use adddocument () to place the Document Object in indexwriter: writer. adddocument (DOC );
You can place multiple.
Finally, you need to call the close () method to close the indexer, for example, writer. Close ();
At this point, the procedure for creating an index is complete. Next we should check the instance:
2. Create a simple index instance:
<Span style = "font-family: simsun; font-size: 12px;"> Import Java. io. file; import Org. apache. lucene. analysis. analyzer; import Org. apache. lucene. analysis. standard. standardanalyzer; import org.apache.e.doc ument. document; import org.apache.e.doc ument. field; import Org. apache. lucene. index. indexwriter; import Org. apache. lucene. store. directory; import Org. apache. lucene. store. simplefsdirectory; import Org. apache. lucene. util. version; public class extends emainprocess {public static void main (string [] ARGs) {createluceneindex ();} public static void createluceneindex () {try {file indexdir = new file ("E: \ Index "); directory dir = new simplefsdirectory (indexdir); analyzer = new standardanalyzer (version. required e_30); // indexwriter is the maximum number of entries in the field obtained when an index is created for a document. The initial value of this attribute is 10000 indexwriter = new indexwriter (Dir, analyzer, true, indexwriter. maxfieldlength. limited); // create eight documents: Document doc1 = new document (); document doc2 = new document (); document doc3 = new document (); document doc4 = new document (); document doc5 = new document (); document doc6 = new document (); document doc7 = new document (); document doc8 = new document (); field F1 = new field ("bookname", "How steel is made", field. store. yes, field. index. analyzed); field F2 = new field ("bookname", "Heroes", field. store. yes, field. index. analyzed); field F3 = new field ("bookname", "Awesome women and dogs", field. store. yes, field. index. analyzed); field F4 = new field ("bookname", "Women are made of water", field. store. yes, field. index. analyzed); field F5 = new field ("bookname", "my brother and daughter", field. store. yes, field. index. analyzed); field F6 = new field ("bookname", "White Haired Girl", field. store. yes, field. index. analyzed); field F7 = new field ("bookname", "Steel world", field. store. yes, field. index. analyzed); field F8 = new field ("bookname", "Iron warrior", field. store. yes, field. index. analyzed); doc1.add (F1); doc2.add (F2); doc3.add (F3); doc4.add (F4); doc5.add (F5); doc6.add (F6); doc7.add (F7 ); doc8.add (F8); indexwriter. adddocument (doc1); indexwriter. adddocument (doc2); indexwriter. adddocument (doc3); indexwriter. adddocument (doc4); indexwriter. adddocument (doc5); indexwriter. adddocument (doc6); indexwriter. adddocument (doc7); indexwriter. adddocument (doc8); indexwriter. optimize (); // The index is optimized to ensure the retrieval speed, but the memory and disk space are consumed, which consumes time and effort and is optimized (not at any time) indexwriter. close (); // close the indexer. Otherwise, the index data will be stuck in the cache and not written to the disk. The directory lock may not be removed.} catch (exception E) {e. printstacktrace () ;}}</span>
After the execution, we found an index directory under the e disk, which is the index file we created.
At this point, the index file is created.
Tomorrow we will sort down the index reading, and then we will be able to intuitively determine whether the index created above is successful.