1.3 Search Program Components
Lucene provides the core modules of the search program: the index module and the class library of the search module.
SOLR is based on Lucene, providing richer UIs and APIs that can be deployed and used directly
is the basic framework for searching for programs. The middle black part is the function of Lucene, and it is also the core part of the search engine.
Search Engine Evaluation:
Meet basic Features: Search results are displayed correctly
Search Reply Time
Extended function: syntax correction, keyword highlighting, etc.
1.3.1 Index Component
Search Engine principle:
Simple thought: Sequential search
Problem: too slow
Workaround: Index text content and return results by index
1. Get content:
Web content: Crawler tools
File system-specific directory, database content: Easy access
Content dispersion (file system, LAN content, etc.): difficult to obtain a
Rights Management System: more complex, need to get root permissions, get a list of permissions, implementation of search permissions control
Content acquisition requirements Run incrementally: can be updated in real time
Lucene does not provide content acquisition and relies entirely on your own programs or third-party programs:
SOLR: Support for databases, XML, integrated Tika
Nutch: web crawler
Lily:solr+hadoop's Distributed Search system
2. Create a document:
Convert the contents of all formats (files, a record of a database, etc.) into Lucene-identified search engine documentation classes: document. Document mainly includes fields with values, such as title, text content, author author, and so on. You can customize the fields, and you can also use the semantic parser to extract body text and write to the new domain separately.
3. Document Analysis:
The document field values are parsed for indexing.
Mainly for word breakers and filters, such as uniform case, extract stem, participle and other operations.
4. Document index: Index The results of the analysis and add them to the index database
Inverted index
Lucene in action first knowledge of Lucene