One of the Lucene. Net series-learning Lucene & Creating Indexes

Source: Internet
Author: User
Tags createindex

Some research on Lucene. Net was done some time ago. Of course, it is also a simple research. I feel that I have learned something, so I want to share it with you. I hope you can give some advice. I am very grateful.

The following describes Lucene. net. This part is transferred fromHttp://www.cnblogs.com/anan/archive/2008/04/20/1162283.html 

I,Lucene Introduction

Lucene can index text data. Therefore, Lucene can index and search your documents as long as you can convert the data format you want to index into text. For example, if you want to index some HTML and PDF documents, you must first convert the HTML and PDF documents into text formats, and then hand the converted content to Lucene for indexing, then, save the created index file to the disk or memory, and query the index file based on the query conditions entered by the user.

Figure 1 shows the search applicationProgramThe relationship with Lucene also reflects the process of using Lucene to build a search application:


Index and search

Indexing is the core of modern search engines. The indexing process is to process the source data into an index file that is very convenient to query.

Why is indexing so important? If you want to search for a document containing a keyword in a large number of documents, you need to read these documents into the memory in sequence and then checkArticleIf it contains the keyword to be searched, it will take a lot of time. Think about the search engine's search results within milliseconds. This is because an index is created. You can think of an index as a data structure that allows you to quickly and randomly access the keywords stored in the index, then find the document associated with the keyword.

Lucene adopts an inverted index mechanism. Reverse indexing means that we maintain a word/phrase table. For each word/phrase in this table, a linked list describes which documents contain the word/phrase. In this way, you can quickly obtain search results when entering query conditions.

Because Lucene provides a simple and easy-to-use API, even if you are not familiar with the full text indexing mechanism at the beginning, Lucene can easily be used to index your documents.

After you have created an index for the document, you can search for these indexes. The search engine first parses the search keywords, then searches for the created indexes, and finally returns the documents associated with the keywords entered by the user.

Lucene namespace Analysis

Namespace: Lucene. net. Documents

This namespace provides classes required to encapsulate documents to be indexed, such as document and field.

In this way, each document is eventually encapsulated into a document object.

Namespace: Lucene. net. Analysis

The main function of this namespace is to perform word segmentation on documents. Because documents must perform word segmentation before creating an index, the role of this package can be seen as preparing for index creation.

Namespace: Lucene. net. Index

This namespace provides classes to help you create indexes and update created indexes. There are two basic classes: indexwriter and indexreader. indexwriter is used to create indexes and add documents to indexes. indexreader is used to delete documents in indexes.

Namespace: Lucene. net. Search

This namespace provides the classes required for searching created indexes. For example, indexsearcher and hits, indexsearcher defines the search method on the specified index, and hits is used to save the search results.

 

 

2. Create an index

Lucene searches for the data in the index file. Some may ask why it is not directly retrieved from the data? First, it is quite difficult to implement word segmentation for full-text retrieval in Database Retrieval. If the data volume is small, it may be several hundred or thousands of characters, so you can use the database for retrieval. However, when the data volume reaches one million or even hundreds of millions, the efficiency of direct retrieval in the database is the biggest problem. A simple search may take several minutes or even longer. The response times out in the Web application.

To create an index, of course, there must be a data source. I use sqlserver, access, SQLite, and so on. It is easy to read data from the database. We will not repeat it here. However, it is worth noting that the data is read from the database and written into the index file by a record. That is to say, first read and then write, instead of reading and writing. How can we store the retrieved data if the database is large? Ilist <t>? Datatable? So how much memory space can be used to store such a large amount of data. Reading is a problem, not to mention writing.

I personally think the best solution is datareader. Its mechanism is that it will retrieve data from the database only when data is called, and it is read by a record, similar to a cursor in a database. Of course, what I can think of in my project is this solution. I don't know how Baidu handles it.

AboveCodeTo read data:

2 ///   <Summary>
3 /// Get all data, Read as a stream
4 ///   </Summary>
5 ///   <Returns> </returns>
6 Public Sqldatareader getdatareader ()
7 {
8 Sqlconnection Conn =   New Sqlconnection (connstr );
9 Conn. open ();
10 String SQL =   " SQL or stored procedure " ;
11 Sqlcommand comm =   New Sqlcommand (SQL, Conn );
12 Sqldatareader datareader = Comm. executereader ();
13 Return Datareader;
14 }

To create an index, you must specify the location where the index file is created.

 1 /// <Index path>

2 /// Index path
3 ///   </Index path>
4 Private   Static   String Indexpath = Configurationmanager. deleettings [ " Indexpath " ]. Trim ();
5 Private   String M_indexpath = Appdomain. currentdomain. basedirectory + Indexpath;

Specify a analyzer, which can be provided by Lucene or downloaded by others. I used pangu word segmentation which is better for Chinese search.

1 ///   <Analyzer>
2 /// Analyzer
3 ///   </Analyzer>

4 PrivateAnalyzer m_analyzer= NewPanguanalyzer ();  

Then read data from the database and write the index file.

1 # Region Create an index
2 Public   Void Createindexer ()
3 {
5 Indexwriter writer =   New Indexwriter (m_indexpath, m_analyzer, True ); // Instantiate an index writer. The first parameter is the location of the index file, and the second parameter is to determine which analysis Word Segmentation Method to write the index.
6 Sqldatareader Reader = Bll. getdatareader (); // read data
8 While (Reader. Read ())
9 {
12 Createindex (reader, writer); // write a single record to the index file
15 }
16 Writer. Close (); // The index file is only written during the write process. After writing, it is closed and only written.

 

18 } 

 

Write a single record to the index file

 1 /// <Summary>

2 /// Index creation method, heavy load
3 ///   </Summary>
4 Public   Void Createindex (sqldatareader reader, indexwriter writer)
5 {
6 Document Doc =   New Document (); // This is equivalent to a record in the index file.
8 Doc. Add ( New Field ( " Field name " , Value , Field. store. yes, field. index. un_tokenized); // The fields to be written to the index file. To improve the efficiency, you only need to write the required fields during retrieval and read only the required fields during data reading. It is advisable to view the record details after retrieving the results and then interact with the database. It is related to the confidentiality of the project, so no field can be written.
41 Writer. adddocument (DOC );
43 }
44 # Endregion

When writing an index, the doc. Add field has two parameters, field. Store and field. Index, which are two enumerations. Next, describe these two enumerations.

1 Field. Store. Yes: store the field value (the field value before word segmentation)

2 Field. Store. No: No storage. The storage has nothing to do with the index.
3 Field. Store. Compress: Compressed Storage, used for long text or binary files, but with impaired performance
4 5 Field. Index. analyzed: Word Segmentation index creation
6 Field. Index. analyzed_no_norms
7 Field. Index. not_analyzed: not segmented and indexed
8 Field. Index. not_analyzed_no_norms: creates an index without word segmentation. The field value is saved in a byte.
9 10 Termvector indicates the entries of the document (located by a document and field) and the number of times they appear in the current document.
11 Field. termvector. Yes: stores the termvector of this field for each document.
12 Field. termvector. No: Do not store termvector
13 Field. termvector. with_positions: storage location
14 Field. termvector. with_offsets: storage offset
15 Field. termvector. with_positions_offsets: storage location and offset

Here is a description. Field. Store is the problem of storage and non-storage, and whether or not to compress the storage. Yes is to write the field value to the index file. No is to write the index file. Compress compressed write

Field. Index is the index method, whether to split the index ...... The above section also provides a detailed description. It may not be clear for the first time. Let me give you an example.

For example, if there is a title field in the record, it must be stored and whether to perform word segmentation? The search title is usually a fuzzy search. As long as the input is included in the title, it will be retrieved. Therefore, the title must be segmented to create an index. This is word segmentation. If you do not perform word segmentation, the input word must be consistent with the field value before it can be retrieved.

 

Index creation is a separate console application, because the data volume is large, although there is enough memory in the Web application to execute index creation, however, the Web has a session time limit. In addition, the console application can fully run the resources of the machine and can be used as much as the memory. Web applications certainly cannot.

The above is a brief introduction to Lucene, and then explains how to create an index. There are many shortcomings. I am still in the initial stage of Lucene application. I hope you can give more advice.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.