Reprinted,ArticleSource: http://blog.csdn.net/akunshenjk/archive/2008/03/28/2226040.aspx
As early as a year ago, I came into contact with e.net whose current version was 1.9.1. 004. When I learned the framework, I was catching up with it and announced that it was going to be commercialized. I may not be able to release a later version. Later, I did not know why, and I made a new open-source version (2.0.0.004 ), it was also a year ago. Just getting started with Lucene, Net aims to improve website search and search efficiency. Because Lucene was migrated from the Java open-source framework, most of the relevant materials were written in Java. The relevant documents were basically in English, so it was hard to learn at that time. Finally, after two weeks of hard work, I finally figured out the temperament of et.net and made the correspondingProgram. However, it took less than a week to find that the write program did not improve the search efficiency, and my colleagues decided to discard et.net and use other methods.
Recently, the data size of websites is getting bigger and bigger, and the efficiency of searching has become a big problem. So I went back to e.net. Of course, the new version (2.0.0.004) will be used this time. As a result, there are some differences between it and 1.9.1.004. Check the learning experience as follows:
I. Create an index: indexwriter
Its constructor:
Public Indexwriter ( String Path, analyzer, Bool Create );
Public Indexwriter (fileinfo path, analyzer, Bool Create );
Public Indexwriter (Directory D, analyzer, Bool Create );
We usually use the first constructor. String path is the index directory, for example, D: \ Search \ index; analyzer a uses a Word Analyzer; for exampleNewStandardanalyzer (); the last parameter indicates whether to re-create an index. If it is the first time to create an index, true is used; otherwise, false is generally used.
Therefore, you can write indexes as follows:Create a new document:
Document Doc = new document ();
// If the index already exists, open it directly. Otherwise, create the index and open it. //This. M_strindexpath is the INDEX DIRECTORY
Indexwriter m_writer = New Indexwriter ( This . M_strindexpath, New Lucene. net. analysis. Standard. standardanalyzer (), ! System. Io. file. exists (system. Io. Path. Combine ( This . M_strindexpath, " Segments " )));
Ii. Create a document (Lucene. net. Documents)
Constructor:
Public Document ();
The main methods are as follows:
Add () |
Add Index |
Get () |
Obtain the stored value of the index Field |
Getboost () |
Get weight |
Setboost () |
Set weight |
Getfield () |
Returns the field name. |
Removefield () |
Delete a field with a specific name from the document |
3. Field (field) fields
Constructor:
Public Field ( String Name, Byte [] Value_renamed, store );
Public Field ( String Name, textreader reader );
Public Field ( String Name, textreader reader, termvector );
Public Field ( String Name, String Value_renamed, store, Index );
Public Field ( String Name, String Value_renamed, store, Index, termvector );
String name is the field name, and the value corresponding to the string value_renamed field. Store pairs should have yes and no. If yes, the corresponding value is stored; otherwise, field is not stored. the index pair should have four values, which is of great help for index creation and optimization.
No |
The field value is not indexed, so this field will not be searched |
No_norms |
If you do not use analyzer to index the field value, it will not be stored according to the specifications. Its advantage is that it occupies less space. |
Tokenized |
The field value can be indexed and searched. Before this rule is stored in the index, an analysis will be used to mark the body, or, if possible, this text is more normalized, which is very useful for common text. |
Un_tokenized |
If the analyzer is not used to index the field value, it can be searched. If no analyzer is used, the value will be stored as a separate project, which is very useful for unique IDs, for example, the number of products |
4. Search (example)
System. datetime dt = System. datetime. now;
Indexsearcher searcher = New Indexsearcher ( @" D: searchprovide " );
String Q = Keyword. text;
Query query1 = Queryparser. parse (Q, " P_name " , New Standardanalyzer ());
Hits hits = Searcher. Search (query1 );
Timespan TS = System. datetime. Now. Subtract (DT );
String Results = TS. totalmilliseconds. tostring () + " Millisecond; found " + Hits. Length () + " Document (s) that matched query' " + Q + " ': <Br/> " ;
Int Count = Hits. Length ();
If (Count > 30 ) Count = 30 ;
For ( Int I = 0 ; I < Count; I ++ )
... {
Document Doc = Hits. DOC (I );
Results + = Doc. Get ( " Htmlpath " ) + " <Br/> " ;
Results + = Doc. Get ( " ID " ) + " <Br/> " ;
}
Label1.text = Results;
Searcher. Close ();
5. Highlight the query results
The plug-in highlighter. NET is available in Lucene. net2.0. This plug-in can be used to highlight search results.
Example:
Using Lucene. net. Highlight; // Introduction
// OthersCodeOmitted
Queryparser Q = New Queryparser ( " P_name " , Analyzer );
Query query1 = Q. parse (request [ " Bizkeyword " ]);
Query. Add (query1, booleanclause. occur. Must );
Lucene. net. Highlight. formatter FM = New Simplehtmlformatter ( " <Span style = "color: red; font-weight: bold"> " , " </Span> " );
Highlighter = New Highlighter (FM, New Queryscorer (query1 ));
Highlighter. settextfragmenter ( New Simplefragmenter ( 100 ));
// Intermediate Code omitted
// Highlight
String P_name = Doc. Get ( " P_name " );
Lucene. net. analysis. tokenstream = Analyzer. tokenstream ( " P_name " , New System. Io. stringreader (p_name ));
P_name = Highlighter. getbestfragments (tokenstream, p_name, 0 , " ... " );
6. Real-time Index Update (pseudo)
Lucene.net is a very useful open-source search framework, but unfortunately it does not support concurrent read/write operations. Indexes cannot be updated during search, but cannot be searched during Index Update. This is really fatal for scenarios where data is often updated.
Someone said in Java: Using compass and Hibernate to update Lucene indexes in real time. I have not tried it, and I have not found any relevant materials.
As far as I know, currently, compass is migrated from Java to. net. As for hibernate, There Is A. Net version of nhib.pdf, But it is said that the efficiency is very high.
To solve this problem, I have tried many ways. In the end, my solution is very good, but I think it is still very useful. Now, my method is ended as follows:
For example, to create an index for the product, we recommend that you create a complete index, save directory A, and then copy and save it to directory B, the main function of directory A is to provide a search index directory under normal circumstances, while directory B is mainly used to update the index (including deleting, modifying, and adding data ); once the update operation is complete, immediately notify the search to change the directory to directory B, delete the old index in A, and copy the latest index in B to directory.
The reason why I add a "pseudo" word here is that it is not actually real-time. There is still a time difference here, because the optimization of the index by e.net is time-consuming and it is impossible for me to insert a piece of data each time, update the index. My approach is to update the index and optimize the index at intervals. When the amount of data to be modified reaches a certain value, I will perform the operation together. This reduces the number of I/O operations.
-------------------------------------------------
PS: Today I found an article: http://blog.csdn.net/poson/archive/2008/03/21/2201880.aspx
This article briefly describes the concurrent operations of etet.net. If it is correct, it will be said in my article. The index cannot be updated during search, but cannot be searched during Index Update. "This is an error. Even so, it is certain that Index Update consumes a lot of CPU resources.