Search in Lucene.Net station

Source: Internet
Author: User

Search in Lucene.Net station

A full text search:

Like query is full table scan (for performance killers)
Lucene.Net search engine, open source, and SQL search engine is charged
Lucene.Net is just a full-text search development package (just help us to save data, and there is no interface, can be regarded as a database, only to retrieve text information)
Lucene.Net principle: Save the text word, then find the article according to the page of the glossary

Two-word algorithm:

Unary segmentation Algorithm (reference Lucene.Net.dll)

One-dollar segmentation algorithm

Binary segmentation algorithm (Cjk:china Japan Korean need to re-reference Cjkanalyzer.cs/cjktokenizer.cs)

Two-dollar segmentation algorithm

Word-base-based segmentation algorithm (Pangu segmentation algorithm)

Open Pangu4luene\webdemo\bin, add dictionaries to the project root path (renamed Dict), for the file under it, in its properties, the output directory is modified to copy if newer
Add a reference to the PanGu.dll (if the direct reference PanGu.dll must be without pangu.xml)
Add a reference to PanGu.Luene.Analyzer.dll in Pangu4luene\release

Where pangu_release_v2.3.1.0\release\dictmanage.exe can view the DICT.DCT binary thesaurus, you can view both the vocabulary and the vocabulary

Pangu Segmentation Algorithm

Three-write Index

Luene.net Write Class Introduction

Open folder, specify the folder to write to
File lock to prevent two people from simultaneously writing files (concurrent)
Determine if there is data in the file, and then update the data without creating
Read the text in the file to read and write to the document
Close after writing, it is unlocked and can be written by someone else (there may be a problem when a bug in the program is forced to unlock during the lock writing process)
The role of each species:
Directory Save data: Fsdirectory (in file), Ramdirectory (in memory)
Indexreader class to read the index library, indexwriter the class to write to the index library
indexreader bool Indexexists (Directory directory) determines whether the directory is an index directory
indexwriter bool IsLocked (Directory directory) to determine if the directory is locked
The indexwriter is automatically locked when it is written, and is automatically unlocked when close. The Indexwriter.unlock method is manually unlocked (such as before the close IndexWriter program crashes, which can cause the lock to remain)
IndexWriter (Directory dir,analyzer a,bool create,maxfieldlength MFL) which folder to write to, what word segmentation algorithm, whether it is created, maximum size
void Adddocument (document DOC), adding documents to the index
Add a field to a document
DeleteAll () Delete all documents, deletedocuments delete documents by criteria
File class constructor Field (string name,string value,field.store store,field.index index,field.termvector termvector)
The above indicates: (Field name, field value, whether to save the original text in the index, index indicates how to create the index (Field.index need to do full-text search, not_analyzed not required), termvector to indicate the distance between the index words, the correlation is low)
Processing concurrency (write-only): With Message Queuing to ensure that only one program (thread) to the index operation, other programs do not directly indexed library writing, but rather the data to be written into the message queue, by a separate program from the message queue to fetch data from the index library to write

Articles are written to the index when new and edited:

a DLL referencing 4 servicestack is used for the queue

Referencing QUARTZ.DLL/COMMON.LOGGING.DL for Scheduled tasks

Reference Lucene.net.dll/pangu.dll/pangu.lucene.analyzer.dll for writing indexes

Add dictionary rename dict, its files modified to copy if newer

article added or edited when queued

article out queue write index

Write Index if error:

Failed to load file or assembly "Pangu, version=2.3.0.0, culture=neutral, Publickeytoken=null"
Or one of its dependencies. The system cannot find the file specified.

Reason:

Executing this write index also requires a program to reference PanGu.dll, and the first program to write to the index is interdependent with the index of the real class that follows.

Four article search:

Query. ADD (New term ("field name", "keyword"))
Query. ADD (New term ("Field Name 2", "Keyword 2"))
Similar to: where field name contains keyword and field name 2contains keyword 2
Phrasequery for searching multiple keywords
Phrasequery.setslop (int slop) to set the maximum distance between words
Booleanquery can implement field name contains keyword or field name 2contains keyword 2

The word segmentation algorithm used in search must be consistent with the generation of indexes, that is, Pangu segmentation algorithm

Total number of bars TotalSize = Collector. Gettotalhits ()

The query result collection should be from (pagenum-1) *5,pagenum*5, but collector. Topdocs (M,n) of M is starting from 0, N is the number of bars

Newscontroller.ashx?action=search

Five Write Index optimizations:

Prevent the interface from dying through multithreading:

Because time-consuming operations block the main process, you need to put time-consuming operations into child threads

Because the main thread is closed, the child threads are also closed, so the thread is set to the back-table thread, so that the thread is closed and the child threads continue

Example:

TestThread.Form1.cs

Instance:

Timed tasks, child process out of queue, then write article index, close window when terminating child process (out of queue) and quartz.net process
First, start the form, perform the scheduled task, and the scheduled task is to make the news out of the queue
Then, the news out of the queue is a time-consuming operation, need to delegate the child process, and set as a background process, and then start the execution process, where the control of the out-of-queue process is controlled by while (IsRunning), first pre-set isrunning=true
IsRunning = true;
Thread thread = new Thread (runscan);//delegate to child thread to Runscan
Thread. IsBackground = true;//The child thread is a background thread
Thread. Start ();//execute the back-table thread to execute the Runscan method
Then, execute out the queue for this back-table process
public static bool IsRunning {get; set;} Whether to continue the thread
public void Runscan ()
{
while (isrunning)//Once the form is closed, isrunning=false, the process terminates
{...
Then, the child process is executed until the form is closed, setting Isrunning=false to terminate the Runscan () of the back-table thread that is still executing, as well as terminating the background quartz.net process, avoiding the form closing and the process still
private void Form1_formclosed (object sender, Formclosedeventargs e)
{
newsindexer.isrunning = false;//The Runscan method of the platform process after terminating
SendNewRegisterUM.schedWNI.Shutdown ();//You also need to terminate the background quartz.net process to avoid the form being closed, but the process is still
}

Timer.NewsIndex.cs TimerForm.FormMain.cs

Get the HTML innertext

The search is not just a title, but also a preview of the body part of the content
You need to filter the HTML tags when you put the lucene.net in the index
Solve the index in the body is all HTML tags problem, not conducive to search, a lot of junk information, display inconvenient.
Use Htmlagilitypack for innertext processing.
Consider the article edit \ re-index and other issues, you need to delete the old document, and then add a new (equivalent to the update) HTML parser: Enter an HTML document, provide an interface to the HTML document operation
Development package Htmlagilitypack.1.4.0.zip, used to innertext HTML tags into the index library

Reference HtmlAgilityPack.dll

Example:

Example

Instance:

Example

Create a news index with one click:

Into the queue:

If there is too much news, the queue should be indexed in batches

newscontroller.ashx NewsBLL.cs

Out queue:

Low efficiency with open read and close of index path for each outbound queue

Open the index directory before all the queues are out of the queue before closing the index directory, and then waiting for the next client queue for new data

The document index of the same ID in the document index needs to be deleted each time the outbound queue is added to the index, because edit news and one-click Rebuild Full-text index are added again to the same ID index

Timer.NewsIndex.cs

Six Search optimization:

Search results highlighted:

Get the most matching snippets in the search results, highlighting the keywords that need to be highlighted

Reference PanGu.HighLight.dll

Front.FrontHelper.csFront.News.NewsController.ashx

Seven Extended tasks:

Project tasks: Complete news search, video note search function, and comprehensive search
Logical idea://search (paging \ Highlighting)------out of the queue--into queue t_segment (Id,name,note,chapterid) \t_news (Id,title,newscontent, CATEGORYID)

Comprehensive Search:

into the queue:Admin/course/cateogorycontroller.ashx?action=allsegmentindexBll/coursebll.cs

out of the queue and write to the index:Timer/newsandsegmentindex.cs

Comprehensive Search:Front/news/newscontroller.ashx?action=search

Category: NET, Rupeng-didao

Search in Lucene.Net station

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.