In daily development, we believe that we often use like to match some data. We also know that like often leads to full table scanning. When the data volume grows, we will be entangled in
Search for the database's turtle speed. Now we have to find another way. Then lucene will be able to show its talents.
First, we will make a demo to insert 10 million data records into the database, totaling 778 MB.
Next, we will search for the "popular" record in the news content.
Mmd: It takes 78 s to search for the database. All of them have to hack into the database.
Let's take a look at how lucene works. : Http://incubator.apache.org/lucene.net/download.html
The Code is as follows: |
Copy code |
Using System; Using System. Collections. Generic; Using System. Linq; Using System. Text; Using Lucene. Net. Index; Using Lucene. Net. Store; Using Lucene. Net. Analysis. Standard; Using Lucene. Net. Documents; Using System. Data; Using System. Diagnostics; Using Lucene. Net. Search; Using Lucene. Net. QueryParsers; Namespace First { Class Program { Static string path = @ "D: Sample "; Static void Main (string [] args) { // Create an index CreateIndex (); Var watch = Stopwatch. StartNew (); // Search IndexSearcher search = new IndexSearcher (path ); // Query expression QueryParser query = new QueryParser (string. Empty, new StandardAnalyzer ()); // Query. parse: inject query Conditions Var hits = search. Search (query. Parse ("Content: popular ")); For (int I = 0; I { Console. writeLine ("current content: {0}", hits. doc (I ). get ("Content "). substring (0, 20) + "... "); } Watch. Stop (); Console. WriteLine ("search time consumed: {0}", watch. ElapsedMilliseconds ); } Static void CreateIndex () { // Create the index library directory Var directory = FSDirectory. GetDirectory (path, true ); // Create an index and use StandardAnalyzer to split sentences IndexWriter indexWriter = new IndexWriter (directory, new StandardAnalyzer ()); Var reader = DbHelperSQL. ExecuteReader ("select * from News "); While (reader. Read ()) { // A set of fields: a document, similar to a table row Document doc = new Document (); // The field to be indexed Doc. Add (new Field ("ID", reader ["ID"]. ToString (), Field. Store. YES, Field. Index. NOT_ANALYZED )); Doc. Add (new Field ("Title", reader ["Title"]. ToString (), Field. Store. NO, Field. Index. ANALYZED )); Doc. Add (new Field ("Content", reader ["Content"]. ToString (), Field. Store. YES, Field. Index. ANALYZED )); IndexWriter. AddDocument (doc ); } Reader. Close (); // Optimize the index file IndexWriter. Optimize (); IndexWriter. Close (); } } } |
I relied on 448 ms and suddenly lost my S. Of course, this time does not include the time for "creating indexes". In terms of time complexity, this type of pre-loaded index is a constant.
As a beginner, we will briefly introduce the implementation process of lucene. lucene is divided into two steps: "Index" and "Search ".
I. Index:
I believe everyone is familiar with indexing. lucene can split our content into many words, use words as keys, create inverted indexes, and put them in the index database.
In this example, we can see that IndexWriter, FSDirectory, StandardAnalyzer, Document, and Field classes are used in the indexing process. The following is a brief analysis.
1: IndexWriter
We can see that this class has an AddDocument method, so we think this class implements the index write operation.
2: FSDirectory
This is simpler. It provides the storage location of the index database. For example, if we use D: Sample here, someone may ask whether the database can be stored in the memory. In the face of powerful lucene, of course
Yes, RAMDirectory in lucene can be implemented. Of course, if our memory is large enough, we can still use the memory to carry the index library, thus improving the search efficiency.
3: StandardAnalyzer
This is the most critical step in the indexing process. It is also something we think very carefully about when using lucene. The key reason why we can search for the second kill is how to extract the input content.
In what form of segmentation, of course, different splitting forms produce different analyzer, StandardAnalyzer is a Analyzer Based on word segmentation.
Chapter.
4: Document
In the above example, we can see that it is a set that carries field and then added to IndexWriter, which is similar to the concept of rows in the table.
5: Field
The processing of the fields to be analyzed is presented in KV format.
①: Field. Store. YES, Field. Index. NOT_ANALYZED indicates that the Index Field is saved as is and not split by StandardAnalyzer.
②: Field. Store. NO, Field. Index. ANALYZED is not saved but must be split by StandardAnalyzer.
Ii. Search
This is relatively easy. According to the words we entered, lucene can quickly locate the words we are looking for in the index library. We can also see IndexSearcher, QueryParser, and Hits.
1: IndexSearcher
This can be understood as opening the index library created by IndexWriter in a read-only manner. search provides QueryParser with a bridge to query.
2: QueryParser
This provides a parse method to convert the words we want to search into lucene which can understand the query expression.
3: Hits
This is a pointer to obtain the matching result. The advantage is similar to the delay loading in C #. The goal is the same and the performance is improved.