Lucene in Action 01 first knowledge of Lucene

Last Update:2015-01-13 Source: Internet

Author: User

Tags solr elastic search

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Based on Lucene 3.0.1

1, Lucene simple definition

Lucene is a high-performance, extensible information Retrieval (IR) tool Library. Provide users with an easy-to-use index and search API, shielding the internal complex and advanced information retrieval technology implementation process
Lucene is just a class library that provides search functionality, and you need to complete other modules of your search program (Web crawling, document processing, server operation, user interface and management, etc.) according to the actual situation.
Lucene's core jar package is small, only 1 m or so, and there are a lot of expansion modules outside the core. In addition, if you are looking for some form of web search, document processing or search engine, you can find it in the Technical support page of Lucene wiki.

Digression: Nutch is an open source web search engine created by Lucene founders, Hadoop is an extension of the Nutch project, providing distributed storage and computing tools through the Map/reduce architecture

2. Search for program components

2.1 Index Components
Indexed (index) is the output of index (indexing)
Get content

Consider where the content of the search is provided, the file system, the Content management system, the website, the database, the local documentation, etc. consider whether a search permission control is required to consider whether an incremental run mechanism is required, and the index is only for documents that have changed since the last run, considering whether the Content acquisition module is required as a continuous running background service. Get new or modified document information in real time there are many open source content acquisition applications, such as SOLR support relational databases and XML documents, integrated Tika can handle complex documents Nutch large-scale web site Crawler Grub web crawler tool Heritrix Internet Document search Program

Create a document

The document mainly includes several fields: the title body abstract author URL and so on, consider how to split the original content into appropriate documents considering whether the document filter needs to be filtered out of the original document format and other information to consider whether you need to create additional fields to consider whether you need to weight the document and the domain, Weighting can also be done dynamically during a search when creating a document, Lucene provides an API to build domains and documents, and Tika can implement document filtering well.

Document Analysis (participle)

Word segmentation

Build an index

Adding a document to the index list, Lucene provides an exceptionally simple API to complete the indexing operation

2.2 Search Components
Search interface

Lucene does not provide the default interface, you have to provide an interface to convert the user's search request Sing Woo is the query object, so that the search engine use

Create a query

For the creation of the query object, Lucene provides a powerful development package called Queryparser

Search query

Querying the index and returning a matching document may filter the returned results, sort the digression: Common search Theory Model: The probabilistic model of the vector space model of pure Boolean model Lucene allows you to decide whether to use a pure Boolean model or a vector space model for specific searches.

Show results

You know

2.3 Other modules of the search program
Management interface

Lucene Management provides a number of configuration options for developers

Analysis interface

Lucene does not provide analysis tools

Distributed search

Two basic ways: if the amount of data processed is large, you can divide the data into small pieces, let multiple servers search for the corresponding parts, and then merge the search results of each part into the total result set if you want to increase the search concurrency, you can copy the same index to multiple servers. Load balanced distribution Request Lucene does not provide distributed support, but both SOLR and Nutch under the Apache Lucene Project provide support for index splitting and replication. There are also Katta, Elastic search and other items. Before you set up your own search, you can learn about these existing solutions

Soup can be kept in a tidy, reproduced annotated

Lucene in Action 01 first knowledge of Lucene

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More