The principle and application of Lucene

Last Update:2014-11-11 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

With the rapid popularization and development of Internet, the influence of network public opinion on social life is increasing, and the research of Internet word-of-mouth has gradually formed a new industry. Effective network word-of-mouth research, need to listen to the voice of netizens. The application of information retrieval technology effectively improves the work efficiency of network word-of-mouth research.
as the most well-known open source information retrieval library, Lucene is widely used in various projects related to full-text search. This article will briefly introduce The basic principle and application of Lucene, in this paper, hope to have the opportunity to communicate with more peers.

Lucene is what

Lucene is an open source, mature full-text indexing and information retrieval (IR) Library, using Java implementation. Its position in the system is equivalent to a database mainly used for full-text retrieval, and the relationship with other modules of the system is as follows :

Lucene vs. database analogy

There are many similarities between Lucene and the database, so here's a simple comparison:

	Database	Luecene
Basic concepts	column / Field	Field
Basic concepts	Line / Record	Document
Basic operations	Enquiry (SELECT)	Searcher
	Add (INSERT)	IndexWriter. Adddocument
	Delete (DELETE)	Indexreader.delete
	Modify (UPDATE)	not supported ( can be removed and re-added )

Lucene index with inverted row (inverted index)

I think a lot of people in the database, have encountered this similar situation: to find the word ' Olympic Games ' data, is generally used Like '% Olympic Games % ' as a condition of SQL statements are made. This workaround has serious performance problems when the amount of data is large. Because of the general database index, this kind of query does not help. Lucene , as a library which is mainly used in the field of full-text search, introduces a technique of inverted index.

Related Concepts
term = field.name + Token.text
Token the smallest unit after participle,as:Olympic Games, China, Beijing ,
Document eachDocumentthere's a unique internal number.ID (inttype),when rebuilding an indexIDmay change
● Inverted index file format()
Term1 DocID1 DocID2 DocID3 ...
Term2 DocID1 DocID2 DocID3 ...
...

from the above format it is not difficult to see that the use of this index file, you can quickly locate the word "Olympic" contains all the article.

Chinese word segmentation and Information retrieval model

in the above index format can be seen, before the index, the need to break a sentence into a word, here will use the Chinese word segmentation technology. Common Chinese Word segmentation algorithm: Forward maximum matching method, inverse maximum matching method and statistic-based word segmentation method; What needs to be explained is:Luceneonly word breaker interfaces are available(No Chinese word segmentation implementation), so it is generally useful to another third-party Chinese word thesaurus.
when searching for articles with the word ' Olympic ', there are1what should be the top of the line when it comes to thousands of articles? This is related toLucenethe scoring mechanism, by defaultLuceneThe scoring is based on the theory of vector space model in information retrieval.
on Chinese word segmentation and Information retrieval model, this is a very big research topic. Interested friends, can go online search related articles for in-depth understanding;

Use Lucene frequently asked questions and suggestions for

Chinese word Thesaurus: free Chinese word-breaker available onlineIkanalyzer (free but not open source),Stanford (Open source but need to be packaged yourselfLuceneInterface)
● combination query criteria: by usingQueryparserclass that can support and,ORand many other combination conditions
● result Sort: LuceneSort by rating by default,by combiningSortwith theSortFieldclass, you can specify multiple sort fields and ascending order, and the index type of the sort field must be un_tokenized
● Distributed Queries: throughLuceneprovided byremotesearchableclass, you can implement distributed queries
● Parallel Queries: When distributed with multiple nodes,can be done byParallelmultisearcherparallel to improve retrieval performance
● participle and query: when the index in the "Olympic" as a word, through the ' Olympics ' is unable to retrieve the corresponding results. This problem can be handled at a smaller granularity by modifying the search condition or the word breaker
● numbers and dates: becauseLuceneThe index library is pressedStringType handling,Therefore, the digital date should be0,make it possible to sort by string comparison correctly
● field Index Type: Emailfields such as dates that do not need to be participle,index type should be selected un_tokenized
● Thread Safety: only one thread should be guaranteed toLuceneLibrary for write operations,there can be multiple threads toLuceneLibrary for read operations

The principle and application of Lucene

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The principle and application of Lucene

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The principle and application of Lucene

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support