Lucene Full-party search framework

Source: Internet
Author: User

1. What is Lucene?
is a full-text search framework, not an app product, he's just a tool that allows you to implement certain products, not as www.baidu.com can use them.
is an open source project of the Apache organization's full-text search engine implemented in Java

2. How does the Luncen work?
The services provided actually consist of two parts: one in one out
Enter: Write, write index to the data source you provide
Out: read out, to provide users with full-text search services, so that users can locate the source by keyword

3. Write process
The data source is first processed by analyzer, including: participle, divided into words
Add the required information from the data source to each field in document, and index the fields that need to be indexed in the order of the word breakers, storing the field that needs to be stored.
Writes an index to memory, which can be memory or disk

4. read out the process

Users to provide search keywords, through the analyzer processing.

Find the corresponding document for the Processed keyword search index.
The user extracts the required field from the document that is found as needed.

5, Lucene terminology
Lucene It's really simple, it's mostly about doing two things: indexing and searching
look at some of the terms used in Lucene, and this isn't going to be described in detail, Just a little bit.----because there is a good thing in this world, called search. One of the most important classes in the
Indexwriter:lucene, which is primarily used to index documents and to control some of the parameters in the index process. The
Analyzer: parser, which is used primarily to analyze various texts encountered by search engines. Commonly used are StandardAnalyzer Analyzer, Stopanalyzer Analyzer, Whitespaceanalyzer Analyzer and so on.
Directory: where the index resides; Lucene provides two index locations, one for disk and one for memory. The index is generally placed on disk, and Lucene provides Fsdirectory and ramdirectory two classes accordingly.
Document: Documentation, which is the equivalent of a cell to be indexed, and any file that can be indexed must be converted to a Document object to be indexed.
Field: Fields.
Indexsearcher: Is the most basic search tool in Lucene, all the retrieval will use the Indexsearcher tool;
Query: Queries, Lucene support fuzzy query, semantic query, phrase query, combined query and so on, such as Termquery,booleanquery,rangequery,wildcardquery and other classes.
Queryparser: is a tool that parses user input and can generate a query object by scanning the string entered by the user.
Hits: After the search is complete, the search results need to be returned and displayed to the user, only to be considered for the purpose of completing the search. In Lucene, the collection of search results is represented by an instance of the Hits class.

6. Lucene Precautions
Index should not be too large, otherwise the efficiency will be very low, more than 1G must consider publishing index issues
It is not recommended to use multi-threading to build the index, the interlocking problem generated is very troublesome, often found that the index is lock, can not be re-recommended situation
Chinese word segmentation is a big problem, the current free word segmentation effect is very poor. If you have the ability or self-realization of a word breaker module, with the shortest path of the segmentation method, online textbooks and demo source code, can be consulted.
When you build an incremental index, it consumes CPU, which causes the CPU to idle for 0 when the traffic is large.
The default scoring mechanism is not very reasonable and needs to be tailored to your business

As a whole, Lucene has to use the above aspects to expand his function in order to be a commercial search engine.

Elasticsearch is a distributed search framework based on Lucene , which includes the following features:
Distributed indexing, search
Index auto-sharding, load balancing
Automatically discover machines, build clusters
Supports RESTful style interfaces
Simple configuration and so on.

First, the foreign excellent elasticsearch use case
GitHub
GitHub uses Elasticsearch to search 20TB of data, including 1.3 billion of files and 130 billion lines of code
GitHub upgraded their code search in January 2013, from SOLR to Elasticsearch, which currently has a cluster size of 26 index storage nodes + 8 client nodes (responsible for processing search requests)

Mozilla
Mozilla is famous for Firefox, which currently uses the Waronorange Project for unit or functional testing, and the results are indexed in JSON to Elasticsearch, and developers can easily find bugs

Sony
Sony uses Elasticsearch as its information search engine

Lucene Full-party search framework

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.