3. Effect of Field Configuration
Index data, simpleCodeAs long as the two methods are done, the most simple and useful class used in the indexing process is field. Next, let's take a look at the effects of various field settings.
Code 3.1:
Code
1 /**/ /// <Summary>
2///Index data
3/// </Summary>
4 Private Void Index ()
5 {
6 Analyzer = New Standardanalyzer ();
7 Indexwriter writer = New Indexwriter ( " Indexdirectory " , Analyzer, True );
8 Adddocument (writer, " My motherland " , " English Words " );
9 Adddocument (writer, " Long live Motherland " , " English syntax " );
10 Adddocument (writer, " Motherland " , " English Unit " );
11 Adddocument (writer, " People " , " Word Test " );
12 Writer. Optimize ();
13 Writer. Close ();
14 }
15 /**/ /// <Summary>
16 /// Prepare data for the index
17 /// </Summary>
18 /// <Param name = "Writer"> Index instance </Param>
19 /// <Param name = "content"> Data to be indexed </Param>
20 Void Adddocument (indexwriter writer, String Title, String Content)
21 {
22 Document document = New Document ();
23 Document. Add ( New Field ( " Title " , Title, field. Store. Yes, field. Index. tokenized ));
24 Document. Add ( New Field ( " Content " , Content, field. Store. Yes, field. Index. tokenized ));
25 Writer. adddocument (document );
26 }
Code 3.1 is the index preparation process. Run, then? Here we will talk about a tool. Luke (lukeall) is a Lucene index management tool on the Java platform. I have implemented a simple DOTNET version. For more information, seeNluke version update information. The next index will use this software to analyze the index.
Now we can start to adjust the field instantiation parameters in the adddocument method to see what impact the index will have after the adjustment. Here, the field corresponding to the title is used as an example.
3.1 field. stroe Option
This option has three values, which are analyzed below.
3.1.1 field. stroe. Yes
This is the default one. Use this option to create an index, and then use nluke to view the index. We found that the title field has eight terms. Switch to the document area and find that the document title contains content. This option indicates storage, so these are normal.
3.1.2 field. stroe. No
Title also has eight terms, but there are no fields in the document. That is to say, you can use this field for search now, but in the search result hits, you cannot use the get method of the document instance to retrieve the content of the field. That is, the field content is not stored.
3.1.3 field. Store. Compress
When it is set to compress, an error is returned. The error message "compression support not configured" is a configuration error. This error is thrown in supportclass and the checkcompressionsupport method. A configuration file is read here, and an instance is created based on the class name specified in the configuration file. This class must implement the supportclass. compressionsupport. icompressionadapter interface. Lucene. Net has a built-in "sharpziplibadapter", but the sharp_zip_lib compilation symbol is required. To see the effect, add the sharp_zip_lib symbol to the project, add the app. config configuration file, and add the Lucene. net. compressionlib. Class key to the program. The value is sharpziplibadapter. Then download icsharpcode. sharpziplib. dll, which is truly compressed.Algorithm. : Http://sourceforge.net/project/downloading.php? Groupname‑sharpdevelop&filename=sharpziplib_0855_bin.zip & use_mirror = nchc
Introduce icsharpcode. sharpziplib. DLL into the project and you can use the compress option. The effect is the same as that of yes.
3.1.4 comparison of results
For field. stroe. Yes, the size of the generated byte is 627 bytes.
Field. stroe. Compress is: 661 bytes
Field. stroe. No: 579 bytes
The use of field. stroe. Compress occupies the largest space, which does not conform to the original idea. That's because the text of our index is too small. You can try to add the index content and compare it with a small one.
3.2 field. Index option
Now set field. stroe to field. stroe. Yes, and then let's take a look at the effect of field. index.
3.2.1 field. Index. tokenized
This option is used to control word segmentation. tokenized indicates that word segmentation is required. There are 8 terms in the title after running, no problem.
3.2.2 field. Index. un_tokenized
There are only four terms after running, and the term is the content originally written, there is no difference with the complete content stored.
3.2.3 field. Index. No
As expected, neither of the title terms is available.
3.2.4 field. Index. no_norms
The effect seems to be the same as that of field. Index. un_tokenized, But it removes all additional information of the entry. For example, it will no longer record things such as the forward and backward distribution of words. This reduces the occupied space. There is also a condition in this usage, that is, as long as it is enabled, it must be fully enabled, otherwise it will become invalid. For example, if four pieces of data are indexed and no_norms is not used, and the next two pieces use no_norms, then the first four pieces of data have no effect on the next two pieces of data.
3.2.5 Effect Analysis
2, 2, and 4 are different, but they can all be searched. In the third case, if it is set to no, it cannot be searched. In the first case, word segmentation can be used for search and sorting. 2 and 4 cannot be segmented for search, and the fourth case cannot be sorted (the word cannot be sorted according to the frequency of occurrence ).
We can also see from the above that, if field. Store is set to no and field. index is set to no, it is the same as not added. Field. Store is used to obtain complete data, while field. index is used for search. In extreme cases, you can set field. store is no, while field. index can be searched. when data is retrieved, it can be retrieved from the data source (such as the database). There is an association rule between them, which can effectively reduce the working pressure of Lucene.
3.3 field. termvector
Field. termvector option. The tool has not yet implemented this function, but you can encode it yourself.
Code 3.3.5.1
Code
1 [Test]
2 Public Void Termvectortest ()
3 {
4 Indexreader = Indexreader. Open ( " Indexdirectory " );
5 Int Numdoc = Reader. numdocs ();
6 For ( Int I = 0 ; I < Numdoc; I ++ )
7 {
8 Console. writeline ( " DOC :# " + I + " ---------------------------- " );
9 Document Doc = Reader. Document (I );
10 Field field = Doc. getfield ( " Title " );
11 Console. writeline ( " Indexed or not: " + Field. isindexed ());
12 Console. writeline ( " Whether it is stored: " + Field. isstored ());
13 Console. writeline ( " Storage Start location: " + Field. isstorepositionwitw.mvector ());
14 Console. writeline ( " Whether to store the end location: " + Field. isstoreoffsetwitw.mvector ());
15 Console. writeline ( " Whether the vector is saved: " + Field. istermvectorstored ());
16 Console. writeline ( " Word Segmentation: " + Field. istokenized ());
17 Console. writeline ( " -------------------------------------------- " );
18 }
19 Reader. Close ();
20 }
After field. termvector is set, you can use code 3.3.5.1 to check the effect. You can try it on your own.