Lucene full-party retrieval framework and Lucene retrieval framework

Source: Internet
Author: User

Lucene full-party retrieval framework and Lucene retrieval framework

1. When is Lucene?
It is a full-text search framework, not an application product. It is just a tool that enables you to implement certain products, and can be used without being used by www.baidu.com.
Is an open-source project of apache to implement full-text search engines in java.

2. How does Luncen work?
The provided service consists of two parts:
Inbound: Write, which writes the data source you provided into the index
Output: read-out, providing full-text search services for users, allowing users to locate the source using keywords

3. Write Process
The data source is first processed by analyzer, including word segmentation and word segmentation.
Add the required information from the data source to various fields in the Document, and index the fields to be indexed in the word segmentation order, and store the fields to be stored.
Writes indexes to the memory. The memory can be memory or disk.

4. Read Process

The user provides search keywords, which are processed by analyzer.

Find the corresponding Document for the processed keyword search index.
The user extracts the required Field from the Document as needed.

5. Lucene terminology
Lucene is actually very simple. It mainly involves two things: Creating indexes and searching.
Let's take a look at some of the terms used in lucene. I am not going to introduce them in detail here, but just click here-because there is a good thing in the world called search.
IndexWriter: One of the most important classes in lucene. It is mainly used to add documents to the index and control the use of some parameters during the index process.
Analyzer: Analyzer, mainly used to analyze various text encountered by search engines. Commonly used include StandardAnalyzer, StopAnalyzer, and WhitespaceAnalyzer.
Directory: the location where the index is stored. lucene provides two types of index storage locations: disk and memory. Generally, indexes are stored on disks. Correspondingly, lucene provides two classes: FSDirectory and RAMDirectory.
Document: Document is equivalent to a unit for indexing. any file that can be indexed must be converted to a Document object for indexing.
Field: Field.
IndexSearcher: it is the most basic search tool in lucene. IndexSearcher is used for all searches;
Query: Query. lucene supports fuzzy Query, semantic Query, phrase Query, and combined Query, for example, TermQuery, BooleanQuery, RangeQuery, and WildcardQuery.
QueryParser: a tool used to parse user input. You can scan user input strings to generate Query objects.
Hits: After the search is complete, the search result must be returned and displayed to the user. Only in this way can the search be completed. In lucene, the set of search results is represented by instances of the Hits class.

6. Lucene precautions
The index cannot be too large, or the efficiency will be very low. When the index is larger than 1 GB, you must consider publishing the index.
We do not recommend that you use multiple threads to create an index. The Interlock problem is very troublesome. We often find that the index is locked and cannot be recommended again.
Chinese Word Segmentation is a big problem. Currently, free Word Segmentation is ineffective. If you have the ability to implement a word segmentation module by yourself and use the Shortest Path splitting method, you can refer to the online materials and demo source code.
The cpu usage is high when the index is added. When the access volume is large, the cpu idle is 0.
The default scoring mechanism is not reasonable and needs to be customized according to your own business.

As a whole, lucene is difficult to use. It must be expanded in the above aspects to serve as a commercial search engine.

ElasticSearch is a distributed search framework developed based on Lucene., Including the following features:
Distributed index and search
Automatic Index sharding and load balancing
Automatic Machine discovery and cluster creation
Supports Restful APIs
Easy configuration.

1. Foreign ElasticSearch Use Cases
GitHub
GitHub uses ElasticSearch to search 20 TB of data, including 1.3 billion files and 130 billion lines of code.
GitHub upgraded their code search in January 2013, from solr to ElasticSearch. Currently, the cluster size is 26 index storage nodes + 8 client nodes (responsible for processing search requests)

Mozilla
Mozilla is famous for Firefox. It currently uses the WarOnOrange project to perform unit or function tests. The test results are indexed to ElasticSearch in json format. developers can easily find bugs.

Sony
Sony uses ElasticSearch as an Information Search Engine

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.