Based on Lucene 3.0.1
1, Lucene simple definition
Lucene is a high-performance, extensible information Retrieval (IR) tool Library. Provide users with an easy-to-use index and search API, shielding the internal complex and advanced information retrieval technology implementation process
Lucene is just a class library that provides search functionality, and you need to complete other modules of your search program (Web crawling, document processing, server operation, user interface and management, etc.) according to the actual situation.
Lucene's core jar package is small, only 1 m or so, and there are a lot of expansion modules outside the core. In addition, if you are looking for some form of web search, document processing or search engine, you can find it in the Technical support page of Lucene wiki.
Digression: Nutch is an open source web search engine created by Lucene founders, Hadoop is an extension of the Nutch project, providing distributed storage and computing tools through the Map/reduce architecture
2. Search for program components
2.1 Index Components
Indexed (index) is the output of index (indexing)
Get content
Consider where the content of the search is provided, the file system, the Content management system, the website, the database, the local documentation, etc. consider whether a search permission control is required to consider whether an incremental run mechanism is required, and the index is only for documents that have changed since the last run, considering whether the Content acquisition module is required as a continuous running background service. Get new or modified document information in real time there are many open source content acquisition applications, such as SOLR support relational databases and XML documents, integrated Tika can handle complex documents Nutch large-scale web site Crawler Grub web crawler tool Heritrix Internet Document search Program
Create a document
The document mainly includes several fields: the title body abstract author URL and so on, consider how to split the original content into appropriate documents considering whether the document filter needs to be filtered out of the original document format and other information to consider whether you need to create additional fields to consider whether you need to weight the document and the domain, Weighting can also be done dynamically during a search when creating a document, Lucene provides an API to build domains and documents, and Tika can implement document filtering well.
Document Analysis (participle)
Word segmentation
Build an index
Adding a document to the index list, Lucene provides an exceptionally simple API to complete the indexing operation
2.2 Search Components
Search interface
Lucene does not provide the default interface, you have to provide an interface to convert the user's search request Sing Woo is the query object, so that the search engine use
Create a query
For the creation of the query object, Lucene provides a powerful development package called Queryparser
Search query
Querying the index and returning a matching document may filter the returned results, sort the digression: Common search Theory Model: The probabilistic model of the vector space model of pure Boolean model Lucene allows you to decide whether to use a pure Boolean model or a vector space model for specific searches.
Show results
You know
2.3 Other modules of the search program
Management interface
Lucene Management provides a number of configuration options for developers
Analysis interface
Lucene does not provide analysis tools
Distributed search
Two basic ways: if the amount of data processed is large, you can divide the data into small pieces, let multiple servers search for the corresponding parts, and then merge the search results of each part into the total result set if you want to increase the search concurrency, you can copy the same index to multiple servers. Load balanced distribution Request Lucene does not provide distributed support, but both SOLR and Nutch under the Apache Lucene Project provide support for index splitting and replication. There are also Katta, Elastic search and other items. Before you set up your own search, you can learn about these existing solutions
Soup can be kept in a tidy, reproduced annotated
Lucene in Action 01 first knowledge of Lucene