The principle and application of Lucene

Source: Internet
Author: User

With the rapid popularization and development of Internet, the influence of network public opinion on social life is increasing, and the research of Internet word-of-mouth has gradually formed a new industry. Effective network word-of-mouth research, need to listen to the voice of netizens. The application of information retrieval technology effectively improves the work efficiency of network word-of-mouth research.
as the most well-known open source information retrieval library, Lucene is widely used in various projects related to full-text search. This article will briefly introduce The basic principle and application of Lucene, in this paper, hope to have the opportunity to communicate with more peers.

Lucene is what

Lucene is an open source, mature full-text indexing and information retrieval (IR) Library, using Java implementation. Its position in the system is equivalent to a database mainly used for full-text retrieval, and the relationship with other modules of the system is as follows :


Lucene vs. database analogy

There are many similarities between Lucene and the database, so here's a simple comparison:

Database Luecene
Basic concepts column / Field Field
Line / Record Document
Basic operations Enquiry (SELECT) Searcher
Add (INSERT)

IndexWriter. Adddocument

Delete (DELETE) Indexreader.delete
Modify (UPDATE) not supported ( can be removed and re-added )

Lucene index with inverted row (inverted index)

I think a lot of people in the database, have encountered this similar situation: to find the word ' Olympic Games ' data, is generally used Like '% Olympic Games % ' as a condition of SQL statements are made. This workaround has serious performance problems when the amount of data is large. Because of the general database index, this kind of query does not help. Lucene , as a library which is mainly used in the field of full-text search, introduces a technique of inverted index.

Related Concepts
term = field.name + Token.text
Token the smallest unit after participle,as:Olympic Games, China, Beijing ,
Document eachDocumentthere's a unique internal number.ID (inttype),when rebuilding an indexIDmay change
● Inverted index file format()
Term1 DocID1 DocID2 DocID3 ...
Term2 DocID1 DocID2 DocID3 ...
...

from the above format it is not difficult to see that the use of this index file, you can quickly locate the word "Olympic" contains all the article.

Chinese word segmentation and Information retrieval model

in the above index format can be seen, before the index, the need to break a sentence into a word, here will use the Chinese word segmentation technology. Common Chinese Word segmentation algorithm: Forward maximum matching method, inverse maximum matching method and statistic-based word segmentation method; What needs to be explained is:Luceneonly word breaker interfaces are available(No Chinese word segmentation implementation), so it is generally useful to another third-party Chinese word thesaurus.
when searching for articles with the word ' Olympic ', there are1what should be the top of the line when it comes to thousands of articles? This is related toLucenethe scoring mechanism, by defaultLuceneThe scoring is based on the theory of vector space model in information retrieval.
on Chinese word segmentation and Information retrieval model, this is a very big research topic. Interested friends, can go online search related articles for in-depth understanding;

Use Lucene frequently asked questions and suggestions for

Chinese word Thesaurus: free Chinese word-breaker available onlineIkanalyzer (free but not open source),Stanford (Open source but need to be packaged yourselfLuceneInterface)
● combination query criteria: by usingQueryparserclass that can support and,ORand many other combination conditions
● result Sort: LuceneSort by rating by default,by combiningSortwith theSortFieldclass, you can specify multiple sort fields and ascending order, and the index type of the sort field must be un_tokenized
● Distributed Queries: throughLuceneprovided byremotesearchableclass, you can implement distributed queries
● Parallel Queries: When distributed with multiple nodes,can be done byParallelmultisearcherparallel to improve retrieval performance
● participle and query: when the index in the "Olympic" as a word, through the ' Olympics ' is unable to retrieve the corresponding results. This problem can be handled at a smaller granularity by modifying the search condition or the word breaker
● numbers and dates: becauseLuceneThe index library is pressedStringType handling,Therefore, the digital date should be0,make it possible to sort by string comparison correctly
● field Index Type: Emailfields such as dates that do not need to be participle,index type should be selected un_tokenized
● Thread Safety: only one thread should be guaranteed toLuceneLibrary for write operations,there can be multiple threads toLuceneLibrary for read operations

The principle and application of Lucene

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.