1. Basic Content
(1) Related Concepts
Analysis refers to the process of converting the field text into the most basic index Representation Unit-term. During the search process, these items are used to determine what documents can match word search conditions.
Analyzer encapsulates analysis operations. It converts text into Vocabulary units by performing several operations. This processing process is also called vocabulary unit process (tokenization ), the text blocks extracted from the continent are called tokens ). After a word unit is combined with its domain name, it forms an item.
(2) When to use Analyzer
Directory returnIndexDir = FSDirectory.open(indexDir);IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_48,new StandardAnalyzer(Version.LUCENE_48));IndexWriter writer = new IndexWriter(returnIndexDir, iwc);
- When queryparser object is used for search
QueryParser parser = new QueryParser(Version.LUCENE_48, "contents",new SimpleAnalyzer(Version.LUCENE_48));
- When the search results are highlighted
(3) Four commonly used analyzers:
- Whitespaceanalyzer, as the name implies, simply splits text into tokens on whitespace characters and makes no other effort to normalize the tokens.
- Simpleanalyzer first splits tokens at non-letter characters, then lowercases each token. Be careful! This analyzer quietly discards numeric characters.
- Stopanalyzer is the same as simpleanalyzer, cipher t it removes common words (called Stop Words, described more in section XXX ). by default it removes common words in the English language (the, A, etc .), though you can pass in your own set.
- Standardanalyzer is Lucene's most sophisticated core analyzer. It has quite a bit of logic to identify certain kinds of tokens, such as company names,
IV. Other content
When creating indexwriter, you must specify the analyzer, for example:
<span></span>IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_48,<span></span>new StandardAnalyzer(Version.LUCENE_48));<span></span>writer = new IndexWriter(returnIndexDir, iwc);
You can specify a analyzer for this document each time you add a document to writer, as shown in figure
writer.addDocument(doc, new SimpleAnalyzer(Version.LUCENE_48));