Use Java APIs in Spring Boot to call lucene and springlucene
Lucene is a sub-project of the 4 jakarta Project Team of the apache Software Foundation. It is an open-source full-text retrieval engine toolkit, but it is not a complete full-text retrieval engine, it is a full-text search engine architecture that provides a complete query engine and index engine, some text analysis engines (English and German ). Lucene aims to provide software developers with a simple and easy-to-use toolkit to conveniently implement full-text retrieval in the target system, or build a complete full-text retrieval engine based on this.
Overview
For example, in a folder or disk, there are many files, such as Notepad, world, Excel, and pdf. We want to search for the files contained by the keywords. For example, if we enter Lucene, all files containing Lucene will be checked out. This is the so-called full-text search.
Therefore, it is easy to think that we should establish a ing between keywords and files, steal a graph in the ppt, and clearly explain how this ing is implemented.
Inverted index
With this ing relationship, let's take a look at Lucene's architecture design.
The following is a diagram of Lucene's data, but it is also a summary of its essence.
We can see that Lucene is mainly used in two steps:
1. Create an index. Use IndexWriter to create indexes for different files and store them in the storage location of the index-related files.
2. Search for keyword-related documents through indexes.
In Lucene, this "inverted index" technology is used to implement related ing.
Lucene Mathematical Model
Documents, domains, and word elements
The document is the atomic unit for Lucene search and indexing. The document is a container that contains one or more domains, and the domain contains the "real" searched content in turn, the Domain value is obtained by word segmentation.
For Example, a novel information can be called a document. The novel information contains multiple domains, For Example, the title), author, introduction, and last update time. If you use Word Segmentation technology for the title domain, you can get one or more word elements (Doudou, broken, Cang, and Qiong ).
Lucene file structure
Hierarchy
Index
An index is stored in a directory.
Segment
An index can contain multiple segments, which are independent from each other. adding a new document may produce new segments. Different segments can be merged into a new segment.
Document
A document is the basic unit for creating an index. Different documents are stored in different segments. A segment can contain multiple documents.
Field
Domain. A document contains different types of information and can be split into indexes.
Term
Word, the smallest unit of the index, is the data after lexical analysis and language processing.
Forward information
The index --> segment --> document --> field --> term is saved in order of layers.
Reverse Information
Reverse Information stores the dictionary inverted table ing: term --> document
IndexWriter
Lucene is one of the most important classes. It is mainly used to add documents to the index and control the use of some parameters during the index process.
Analyzer
Analyzer is mainly used to analyze various text encountered by search engines. Commonly used include StandardAnalyzer, StopAnalyzer, and WhitespaceAnalyzer.
Directory
Index storage location. lucene provides two types of index storage locations: disk and memory. Generally, indexes are stored on disks. Correspondingly, lucene provides two classes: FSDirectory and RAMDirectory.
Document
Document is equivalent to a unit for indexing. any file that can be indexed must be converted to a Document object before being indexed.
Field
Field.
IndexSearcher
It is the most basic retrieval tool in lucene. IndexSearcher is used for all searches;
Query
Query: lucene supports fuzzy query, semantic query, phrase query, and combined query, such as TermQuery, BooleanQuery, RangeQuery, and WildcardQuery.
QueryParser
Is a tool for parsing user input. You can generate a Query object by scanning user input strings.
Hits
After the search is complete, you need to return the search result and display it to the user. Only in this way can the search be completed. In lucene, the set of search results is represented by instances of the Hits class.
Test Cases
Github code
I have put the code on Github and imported the spring-boot-lucene-demo project.
Github spring-boot-lucene-demo
Add dependency
<! -- Parse word segmentation index query --> <dependency> <groupId> org. apache. lucene </groupId> <artifactId> lucene-queryparser </artifactId> <version> 7.1.0 </version> </dependency> <! -- Highlight --> <dependency> <groupId> org. apache. lucene </groupId> <artifactId> lucene-highlighter </artifactId> <version> 7.1.0 </version> </dependency> <! -- Smartcn Chinese Word divider SmartChineseAnalyzer the smartcn word divider must depend on lucene and be synchronized with the lucene version --> <dependency> <groupId> org. apache. lucene </groupId> <artifactId> lucene-analyzers-smartcn </artifactId> <version> 7.1.0 </version> </dependency> <! -- Ik-analyzer Chinese Word Divider --> <dependency> <groupId> cn. bestwu </groupId> <artifactId> ik-analyzers </artifactId> <version> 5.1.0 </version> </dependency> <! -- MMSeg4j word Divider --> <dependency> <groupId> com. chenlb. mmseg4j </groupId> <artifactId> mmseg4j-solr </artifactId> <version> 2.4.0 </version> <exclusions> <exclusion> <groupId> org. apache. solr </groupId> <artifactId> solr-core </artifactId> </exclusion> </exclusions> </dependency>
Configure lucene
Private Directory directory; private IndexReader indexReader; private IndexSearcher indexSearcher; @ Beforepublic void setUp () throws IOException {// the location where the index is stored. Set directory to FSDirectory in the current Directory. open (Paths. get ("indexDir/"); // reader for index creation: indexReader = DirectoryReader. open (directory); // create an index Finder to retrieve the index database indexSearcher = new IndexSearcher (indexReader);} @ Afterpublic void tearDown () throws Exception {indexReader. close ();} *** execute the query and print the number of queried records ** @ param Query * @ throws IOException */public void executeQuery (query) throws IOException {TopDocs topDocs = indexSearcher. search (query, 100); // print the number of queried records System. out. println ("total query" + topDocs. totalHits + "documents"); for (ScoreDoc scoreDoc: topDocs. scoreDocs) {// obtain the corresponding Document object document Document = indexSearcher.doc(scoreDoc.doc); System. out. println ("id:" + document. get ("id"); System. out. println ("title:" + document. get ("title"); System. out. println ("content:" + document. get ("content");}/*** word segmentation printing ** @ param analyzer * @ param text * @ throws IOException */public void printAnalyzerDoc (Analyzer analyzer, string text) throws IOException {TokenStream tokenStream = analyzer. tokenStream ("content", new StringReader (text); CharTermAttribute charTermAttribute = tokenStream. addAttribute (CharTermAttribute. class); try {tokenStream. reset (); while (tokenStream. incrementToken () {System. out. println (charTermAttribute. toString ();} tokenStream. end ();} finally {tokenStream. close (); analyzer. close ();}}
Create an index
@ Testpublic void indexWriterTest () throws IOException {long start = System. currentTimeMillis (); // index storage location, Set Directory directory = FSDirectory in the current Directory. open (Paths. get ("indexDir/"); // In Versions later than 6.6, version is no longer necessary, and there is no parameter constructor. You can use the default StandardAnalyzer splitter directly. Version version = Version. paie_7_00000; // Analyzer analyzer = new StandardAnalyzer (); // standard Analyzer for English // analyzer Analyzer = new SmartChineseAnalyzer (); // Chinese Word Segmentation // Analyzer analyzer = new ComplexAnalyzer (); // Chinese Word Segmentation // Analyzer analyzer = new IKAnalyzer (); // Chinese Word Segmentation Analyzer analyzer = new IKAnalyzer (); // Chinese Word Segmentation // create an index write configuration IndexWriterConfig indexWriterConfig = new IndexWriterConfig (analyzer); // create an index write object IndexWriter indexWriter = new IndexWriter (directory, indexWriterConfig ); // create a Document Object and store the index Document doc = new Document (); int id = 1; // Add the field to doc. add (new IntPoint ("id", id); doc. add (new StringField ("title", "Spark", Field. store. YES); doc. add (new TextField ("content", "Apache Spark is a fast and universal computing engine designed for large-scale data processing", Field. store. YES); doc. add (new StoredField ("id", id); // Save the doc object to indexWriter in the index library. addDocument (doc); indexWriter. commit (); // close the stream indexWriter. close (); long end = System. currentTimeMillis (); System. out. println ("index cost" + (end-start) + "millisecond ");}
Response
17:58:14. 655 [main] DEBUG org. wltea. analyzer. dic. dictionary-load extended Dictionary: ext. dic17: 58: 14.660 [main] DEBUG org. wltea. analyzer. dic. dictionary-load extended stopword Dictionary: stopword. it takes 879 milliseconds for the dic Index
Delete document
@ Testpublic void deleteDocumentsTest () throws IOException {// Analyzer analyzer = new StandardAnalyzer (); // standard Analyzer, applicable to English // analyzer Analyzer = new SmartChineseAnalyzer (); // Chinese Word Segmentation // Analyzer analyzer = new ComplexAnalyzer (); // Chinese Word Segmentation // Analyzer analyzer = new IKAnalyzer (); // Chinese Word Segmentation Analyzer analyzer = new IKAnalyzer (); // Chinese Word Segmentation // create index write configuration IndexWriterConfig indexWriterConfig = new IndexWriterConfig (analyzer );/ /Create the index write object IndexWriter indexWriter = new IndexWriter (directory, indexWriterConfig); // Delete the document long count = indexWriter whose title contains the keyword "Spark. deleteDocuments (new Term ("title", "Spark"); // IndexWriter also provides the following methods: // DeleteDocuments (Query query ): delete one or more documents // DeleteDocuments (Query [] queries) based on Query conditions: delete one or more documents // DeleteDocuments (Term term) based on Query conditions ): delete one or more documents // DeleteDocuments (Term [] terms): delete one or more documents according to the Term // DeleteAll (): When you delete all documents // use IndexWriter to delete a Document, the document will not be deleted immediately, but will cache the deletion action when IndexWriter. commit () or IndexWriter. when Close () is executed, the delete operation is actually executed. IndexWriter. commit (); indexWriter. close (); System. out. println ("deleted:" + count );}
Response
Deleted successfully: 1
Update document
/*** Test update ** is actually added after deletion ** @ throws IOException */@ Testpublic void updateDocumentTest () throws IOException {// Analyzer analyzer = new StandardAnalyzer (); // standard word divider, applicable to English // Analyzer analyzer = new SmartChineseAnalyzer (); // Chinese Word Segmentation // Analyzer analyzer = new ComplexAnalyzer (); // Chinese Word Segmentation // Analyzer analyzer = new IKAnalyzer (); // Chinese Word Segmentation Analyzer analyzer = new IKAnalyzer (); // Chinese Word Segmentation // create an index write configuration IndexWriterConfig indexWriterConfig = new IndexWriterConfig (analyzer); // create an index write object IndexWriter indexWriter = new IndexWriter (directory, indexWriterConfig ); document doc = new Document (); int id = 1; doc. add (new IntPoint ("id", id); doc. add (new StringField ("title", "Spark", Field. store. YES); doc. add (new TextField ("content", "Apache Spark is a fast and universal computing engine designed for large-scale data processing", Field. store. YES); doc. add (new StoredField ("id", id); long count = indexWriter. updateDocument (new Term ("id", "1"), doc); System. out. println ("Update document:" + count); indexWriter. close ();}
Response
Update document: 1
Search by entry
/*** Search by entry ** <p> * TermQuery is the simplest and most commonly used Query. TermQuery can be understood as "term search". * The most basic search in a search engine is to search for a certain term in an index, and TermQuery is used to accomplish this task. * In Lucene, a word is the most basic unit of search. In essence, a word is actually a name/value pair. * However, this "name" indicates the field name, while "value" indicates a keyword contained in the field. ** @ Throws IOException */@ Testpublic void termQueryTest () throws IOException {String searchField = "title"; // This is an api for conditional query, used to add the condition TermQuery query = new TermQuery (new Term (searchField, "Spark"); // execute the query and print the number of queried records executeQuery (query );}
Response
A total of 1 documents are found.
Id: 1
Title: Spark
Content: Apache Spark is a fast and universal computing engine designed for large-scale data processing!
Multi-condition Query
/*** Multi-condition Query ** BooleanQuery is also a kind of Query that is frequently used during development. * It is actually a Combined Query. You can add various Query objects and mark the logical relationships between them during use. * BooleanQuery is a Boolean clause container. It provides a special API method to add a clause to it. * It indicates the relationship between them, the following code is the API interface provided by BooleanQuery for adding clauses: ** @ throws IOException */@ Testpublic void BooleanQueryTest () throws IOException {String searchField1 = "title "; string searchField2 = "content"; Query query1 = new TermQuery (new Term (searchField1, "Spark"); Query query2 = new TermQuery (new Term (searchField2, "Apache"); BooleanQuery. builder builder = New BooleanQuery. builder (); // BooleanClause is used to represent the class of the Boolean query clause relationship. // The packages include: // BooleanClause. occur. MUST, // BooleanClause. occur. MUST_NOT, // BooleanClause. occur. shocould. // It MUST contain, cannot contain, and can contain three types. There are 6 combinations: // 1. MUST and MUST: Get the intersection of the connected query clauses. // 2. MUST and MUST_NOT: the query results cannot contain the search results of the query clause corresponding to MUST_NOT. // 3. When the host and MUST_NOT are connected, the functions are the same as those of MUST and MUST_NOT. // 4. When the shoshould and MUST are connected, the result is the retrieval result of the MUST clause, but the shoshould can affect the sorting. // 5. The relationship between shoshould and shoshould: indicates the "or" relationship, and the final search result is the union of all search clauses. // 6. MUST_NOT and MUST_NOT: meaningless. No results are returned. Builder. add (query1, BooleanClause. occur. shocould); builder. add (query2, BooleanClause. occur. shocould); BooleanQuery query = builder. build (); // execute the query and print the number of queried records executeQuery (query );}
Response
A total of 1 documents are found.
Id: 1
Title: Spark
Content: Apache Spark is a fast and universal computing engine designed for large-scale data processing!
Match prefix
/*** Match prefix ** <p> * PrefixQuery is used to match the document whose index starts with the specified string. Is the xxx % * <p> ** @ throws IOException */@ Testpublic void prefixQueryTest () throws IOException {String searchField = "title "; term term = new Term (searchField, "Spar"); Query query = new PrefixQuery (term); // execute the query and print the number of queried records executeQuery (Query );}
Response
A total of 1 documents are found.
Id: 1
Title: Spark
Content: Apache Spark is a fast and universal computing engine designed for large-scale data processing!
Phrase search
/*** Phrase search ** <p> * the so-called PhraseQuery is to search by phrase. For example, if I want to query the phrase "big car, * If the specified item of the document to be matched contains the phrase "big car", * the document will be matched successfully. However, if the sentence to be matched contains "big black car", * the matching fails. If you want to make this match, you need to set slop, * The concept of slop is given first: slop refers to the maximum distance allowed between the positions of two items ** @ throws IOException */@ Testpublic void phraseQueryTest () throws IOException {String searchField = "content"; String query1 = "apache"; String query2 = "spark"; PhraseQuery. builder builder = new PhraseQuery. builder (); builder. add (new Term (searchField, query1); builder. add (new Term (searchField, query2); builder. setSlop (0); PhraseQuery phraseQuery = builder. build (); // execute the query and print the number of queried records executeQuery (phraseQuery );}
Response
A total of 1 documents are found.
Id: 1
Title: Spark
Content: Apache Spark is a fast and universal computing engine designed for large-scale data processing!
Search for similar words
/*** Search for similar words ** <p> * FuzzyQuery is a fuzzy query that can easily recognize two similar words. ** @ Throws IOException */@ Testpublic void fuzzyQueryTest () throws IOException {String searchField = "content"; Term t = new Term (searchField, "Large Scale "); query query = new FuzzyQuery (t); // execute the query and print the number of queried records executeQuery (Query );}
Response
A total of 1 documents are found.
Id: 1
Title: Spark
Content: Apache Spark is a fast and universal computing engine designed for large-scale data processing!
Wildcard search
/*** Wildcard search ** <p> * Lucene also provides wildcard query, Which is WildcardQuery. * Wildcard "?" It represents 1 character, while "*" represents 0 to multiple characters. ** @ Throws IOException */@ Testpublic void wildcardQueryTest () throws IOException {String searchField = "content"; Term term = new Term (searchField, "Large * scale "); query query = new WildcardQuery (term); // execute the query and print the number of queried records executeQuery (Query );}
Response
A total of 1 documents are found.
Id: 1
Title: Spark
Content: Apache Spark is a fast and universal computing engine designed for large-scale data processing!
Word Segmentation Query
/*** Word segmentation query ** @ throws IOException * @ throws ParseException */@ Testpublic void queryParserTest () throws IOException, ParseException {// Analyzer analyzer = new StandardAnalyzer (); // standard word divider, applicable to English // Analyzer analyzer = new SmartChineseAnalyzer (); // Chinese Word Segmentation // Analyzer analyzer = new ComplexAnalyzer (); // Chinese Word Segmentation // Analyzer analyzer = new IKAnalyzer (); // Chinese Word Segmentation String searchField = "content "; // specify the search field and analyzer QueryParser parser = new QueryParser (searchField, analyzer); // Query query = parser. parse ("computing engine"); // execute the query and print the number of queried records executeQuery (query );}
Response
A total of 1 documents are found.
Id: 1
Title: Spark
Content: Apache Spark is a fast and universal computing engine designed for large-scale data processing!
Multiple Field word segmentation queries
/*** Query multiple Field word segmentation ** @ throws IOException * @ throws ParseException */@ Testpublic void multiFieldQueryParserTest () throws IOException, parseException {// Analyzer analyzer = new StandardAnalyzer (); // standard Analyzer for English // analyzer Analyzer = new SmartChineseAnalyzer (); // Chinese Word Segmentation // Analyzer analyzer = new ComplexAnalyzer (); // Chinese Word Segmentation // Analyzer analyzer = new IKAnalyzer (); // Chinese Word Segmentation Analyzer analyzer = new IKAnalyzer (); // Chinese Word Segmentation String [] filedStr = new String [] {"title", "content"}; // specify the search field and analyzer QueryParser queryParser = new MultiFieldQueryParser (filedStr, analyzer); // Query query = queryParser. parse ("Spark"); // execute the query and print the number of queried records executeQuery (query );}
Response
A total of 1 documents are found.
Id: 1
Title: Spark
Content: Apache Spark is a fast and universal computing engine designed for large-scale data processing!
Chinese Word Divider
/*** IKAnalyzer Chinese Word divider * SmartChineseAnalyzer smartcn word divider requires lucene dependency and is synchronized with lucene version ** @ throws IOException */@ Testpublic void AnalyzerTest () throws IOException {// Analyzer analyzer = new StandardAnalyzer (); // standard Analyzer for English // analyzer Analyzer = new SmartChineseAnalyzer (); // Chinese Word Segmentation // Analyzer analyzer = new ComplexAnalyzer (); // Chinese Word Segmentation // Analyzer analyzer = new IKAnalyzer (); // Chinese Word Segmentation Analyzer analyzer = null; string text = "Apache Spark is a fast and universal computing engine designed for large-scale data processing"; analyzer = new IKAnalyzer (); // IKAnalyzer Chinese Word Segmentation printAnalyzerDoc (analyzer, text ); system. out. println (); analyzer = new ComplexAnalyzer (); // MMSeg4j Chinese Word Segmentation printAnalyzerDoc (analyzer, text); System. out. println (); analyzer = new SmartChineseAnalyzer (); // printAnalyzerDoc (analyzer, text );}
Three word segmentation responses
Apachespark is designed for processing large-scale modular data.
Apachespark is a fast and universal computing engine designed for large-scale data processing.
Apachspark is a fast and universal computing engine designed for large-scale data processing.
Highlight
/*** Highlighted processing ** @ throws IOException */@ Testpublic void HighlighterTest () throws IOException, ParseException, InvalidTokenOffsetsException {// Analyzer analyzer = new StandardAnalyzer (); // standard word divider, applicable to English // Analyzer analyzer = new SmartChineseAnalyzer (); // Chinese Word Segmentation // Analyzer analyzer = new ComplexAnalyzer (); // Chinese Word Segmentation // Analyzer analyzer = new IKAnalyzer (); // Chinese Word Segmentation String searchField = "content "; string text = "Apache Spark large-scale data processing"; // specify the search field and analyzer QueryParser parser = new QueryParser (searchField, analyzer); // Query query = parser in user input content. parse (text); TopDocs topDocs = indexSearcher. search (query, 100); // keyword highlighted html Tag, need to import lucene-highlighter-xxx.jar SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter ("<span style = 'color: red'> ", "</span>"); Highlighter highlighter = new Highlighter (simpleHTMLFormatter, new QueryScorer (query); for (ScoreDoc scoreDoc: topDocs. scoreDocs) {// obtain the corresponding Document object document = indexSearcher.doc(scoreDoc.doc); // Add a highlighted TokenStream tokenStream = analyzer. tokenStream ("content", new StringReader (document. get ("content"); String content = highlighter. getBestFragment (tokenStream, document. get ("content"); System. out. println (content );}}
Response
<Span style = 'color: red'> Apache </span> <span style = 'color: red'> Spark </span> is designed for <span style = 'color: red'> large-scale data processing </span> is a fast and universal computing engine designed!
I have put the code on Github and imported the spring-boot-lucene-demo project.
Github spring-boot-lucene-demo
The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.