Lucene's nature

Last Update:2018-12-07 Source: Internet

Author: User

Tags bitset

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

To answer this question, we must first understand the nature of lucene. In fact, lucene has a very simple function. In the end, you give it several strings, and then it provides you with a full-text search service, telling you where the keywords you want to search appear. By knowing this essence, you can use your imagination to do anything that meets this condition. You can index all the news on the site and create a database. You can index several fields in a database table, you don't have to worry about locking the table because of "% like %". You can also write your own search engine ...... 1.3 you should choose lucene to provide some test data. If you think it is acceptable, you can choose. Test 1: 2.5 million records, about 800 mb of text, about 300 MB of generated index, and the average processing time in threads is ms. Test 2: 37000 records, two varchar fields in the index database, with an index file of 2.6 MB and an average processing time of 800 ms in 1.5 threads. 2. How lucene works: the service provided by lucene consists of two parts: one in and one out. The so-called "inbound" refers to writing or removing the source (essentially a string) You provide from the index. The so-called "outbound" refers to providing full-text search services to users, allows you to locate the source using keywords. 2.1 The Source string of the write process is first processed by analyzer, including word segmentation, divided into words, and stopword removal (optional ). Add the required information in the source to each Field of the Document, index the Field to be indexed, and store the Field to be stored. Write the index into the memory. The memory can be memory or disk. 2.2 read-out process users provide search keywords, which are processed by analyzer. Find the corresponding Document for the processed keyword search index. The user extracts the required Field from the Document as needed. 3. Some concepts that need to be known lucene uses some concepts to understand their meanings, which is helpful for the following explanation. 3.1 analyzer Analyzer is a analyzer that divides a string into words according to certain rules and removes invalid words, invalid words here refer to words such as "of", "the", and "of" and "location" in English. These words are frequently used in articles, but it does not contain any key information. Removing the key information helps narrow down the index file, improve efficiency, and increase the hit rate. Word Segmentation rules are ever-changing, but there is only one purpose: to divide words by semantics. This is easy to implement in English, because English itself is a unit of words, which have been separated by spaces; while Chinese must divide sentences into words in some way. The specific partitioning method is described in detail below. Here you only need to understand the concept of analyzer. 3.2 The source provided by the document user is a record, which can be a text file, a string, or a record of a database table. After a record is indexed, it is stored in the index file as a Document. The user performs a search and returns it in the form of a Document list. 3.3 field a Document can contain multiple information fields. For example, an article can contain information fields such as "title", "body", and "last modification time, these information fields are stored in the Document through the Field. Field has two attributes: storage and index. You can use the storage attribute to control whether to store the Field. You can use the index attribute to control whether to index the Field. This seems a bit nonsense. In fact, the correct combination of these two attributes is very important. The following example shows that we need to search the title and body in full text based on the previous article, therefore, we need to set the index attribute to true, and we want to extract the article title from the search results directly. Therefore, we set the storage attribute of the title field to true, but because the body field is too large, to reduce the index file size, we set the storage attribute of the body domain to false and directly read the file when necessary. We just want to extract the last modification time from the search result, you do not need to search for it, so we set the storage attribute of the last modified time domain to true, and the index attribute to false. The above three fields cover three combinations of two attributes, and one completely false is not used. In fact, the Field does not allow you to set this, because it is meaningless to store a domain that is neither stored nor indexed. 3.4 term is the smallest unit of search. It represents a word in a document. term consists of two parts: the word it represents and the field it appears. 3.5 tocken is a term that contains the trem text, the corresponding start and end offset, and a type string. The same words can appear multiple times in a sentence. They are represented by the same term, but different tocken is used. Each tocken marks the place where the word appears. 3.6 When a segment index is added, not every document is immediately added to the same index file. They are first written to different small files and then merged into a large index file, each small file is a segment. 4 lucene's structure lucene consists of core and sandbox. core is the core of lucene's stability. sandbox contains some additional functions, such as highlighter and various analyzers. Lucene core has seven packages: analysis, document, index, queryParser, search, store, and util. 4.1 analysis Analysis includes some built-in analyzer, such as WhitespaceAnalyzer, Which is segmented by blank characters, added StopAnalyzer filtered by stopwrod, the most commonly used StandardAnalyzer. 4.2 document Document contains the data structure of the Document. For example, the Document class defines the data structure of the stored document, and the Field class defines a domain of the Document. 4.3 index Index contains the index read/write class, such as the Index file segment write, merge, and optimized IndexWriter class, And the IndexReader class that reads and deletes the index, it should be noted that the name of IndexReader should not be misled as an index file reading class. In fact, deleting an index is also done by it. IndexWriter only cares about How to Write indexes into segments, and merge and optimize them. IndexReader focuses on the organization of each document in the index file. 4.4 queryParser QueryParser contains a class for parsing query statements. lucene's query statements are similar to SQL statements. There are various reserved words that can be used to form various queries according to certain syntaxes. Lucene has many Query classes that inherit from Query and execute various special queries. QueryParser is used to parse Query statements and call various Query classes in order to find the results. 4.5 search Search contains various types of search results from the index, such as the various Query classes just mentioned, including TermQuery and BooleanQuery. 4.6 store Store contains an index storage class. For example, Directory defines the storage structure of the index file, FSDirectory is the index stored in the file, and RAMDirectory is the index stored in the memory, mmapDirectory is the index using memory ing. 4.7 util Util contains some common tool classes, such as the conversion tool between time and string. 5. code snippet IndexWriter writer = new IndexWriter ("/data/index/", new StandardAnalyzer (), true) of how to create an index 5.1 is the simplest code snippet to complete an index ); document doc = new Document (); doc. add (new Field ("title", "lucene introduction", Field. store. YES, Field. index. TOKENIZED); doc. add (new Field ("content", "lucene works well", Field. store. YES, Field. index. TOKENIZED); writer. addDocument (doc); writer. optimize (); writer. close (); Next we will analyze this code. First, we create a writer and specify the directory where the indexes are stored as "/data/index". The analyzer used is StandardAnalyzer, the third parameter indicates that if an index file already exists in the index directory, We will overwrite it. Then we create a new document. We add a field named "title" to the document and the content is "lucene introduction", which is stored and indexed. Add another field named "content" with the content "lucene works well", which is also stored and indexed. Then we add this document to the index. If there are multiple documents, we can repeat the above operations to create and add the document. After adding all documents, we optimize the index. The optimization mainly involves merging multiple segments into one, which is conducive to improving the index speed. It is important to disable writer later. Yes, creating an index is that simple! Of course, you may modify the above code to get a more personalized service. 5.2 To write the index directly in the memory, you need to first create a RAMDirectory and pass it to writer. The Code is as follows: Directory dir = new RAMDirectory (); IndexWriter writer = new IndexWriter (dir, new StandardAnalyzer (), true); Document doc = new Document (); doc. add (new Field ("title", "lucene introduction", Field. store. YES, Field. index. TOKENIZED); doc. add (new Field ("content", "lucene works well", Field. store. YES, Field. index. TOKENIZED); writer. addDocument (doc ); Writer. optimize (); writer. close (); 5.3 index text files if you want to index plain text files without reading them into strings to create fields, you can use the following code to create fields: field field = new Field ("content", new FileReader (file); the file here is the text file. This constructor actually reads the file content and indexes it, but does not store it. 6. How to maintain the index is provided by the IndexReader class. 6.1 how to delete an index lucene provides two methods to delete a document from an index. One is void deleteDocument (int docNum), which is based on the document number in the index, each document has a unique number after being added to the index. Therefore, deleting a document by number is a precise deletion, but this number is the internal structure of the index, generally, we do not know the number of a file, so it is of little use. The other method is void deleteDocuments (Term term). In fact, you first perform a search operation based on the parameter term, and then delete the search results in batches. We can use this method to provide a strict query condition to delete a specified document. The following is an example: Directory dir = FSDirectory. getDirectory (PATH, false); IndexReader reader = IndexReader. open (dir); Term term = new Term (field, key); reader. deleteDocuments (term); reader. close (); 6.2 how to update indexes lucene does not provide a special index update method. We need to delete the corresponding document and then add the new document to the index. Example: Directory dir = FSDirectory. getDirectory (PATH, false); IndexReader reader = IndexReader. open (dir); Term term = new Term ("title", "lucene introduction"); reader. deleteDocuments (term); reader. close (); IndexWriter writer = new IndexWriter (dir, new StandardAnalyzer (), true); Document doc = new Document (); doc. add (new Field ("title", "lucene introduction", Field. store. YES, Field. index. TOKENIZED); doc. Add (new Field ("content", "lucene is funny", Field. store. YES, Field. index. TOKENIZED); writer. addDocument (doc); writer. optimize (); writer. close (); 7. How to search for lucene is quite powerful. It provides many auxiliary Query classes. Each class inherits from the Query Class and completes a special Query, you can use them in any combination like building blocks to complete some complex operations. lucene also provides the Sort class to Sort the results and the Filter class to limit the query conditions. You may unconsciously compare it with the SQL statement: "Can lucene execute and, or, order by, where, like '% xx % ?" The answer is: "Of course no problem !" 7.1 various Query operations: 7.1.1 TermQuery first introduces the most basic Query. If you want to execute a Query like this: "document containing 'lucene 'in the content field", you can use TermQuery: Term t = new Term ("content", "lucene "; query query = new TermQuery (t); 7.1.2 BooleanQuery: "A document containing java or perl In the content field ", then you can establish two termqueries and connect them with BooleanQuery: TermQuery termQuery1 = new TermQuery (new Term ("content", "java"); TermQuery 2 = new Term Query (new Term ("content", "perl"); BooleanQuery booleanQuery = new BooleanQuery (); booleanQuery. add (termQuery 1, BooleanClause. occur. shocould); booleanQuery. add (termQuery 2, BooleanClause. occur. shocould); 7.1.3 WildcardQuery if you want to query a word using wildcards, you can use WildcardQuery. The wildcards include '? 'Match any character and '*' match zero or multiple arbitrary characters. For example, if you search for 'use * ', you may find 'useful' or 'useless ': query query = new WildcardQuery (new Term ("content", "use *"); 7.1.4 PhraseQuery you may be interested in Sino-Japanese relations, if you want to find an article that is close to (within five characters) between "medium" and "day", you can: phraseQuery query = new PhraseQuery (); query. setSlop (5); query. add (new Term ("content", "medium"); query. add (new Term ("content", "day"); then it may find "Sino-Japanese cooperation ......" "China and Japan ......", However, "a top Chinese leader says Japan is not flat" is not found ". 7.1.5 PrefixQuery if you want to search for words starting with medium, you can use PrefixQuery: PrefixQuery query = new PrefixQuery (new Term ("content", "medium "); 7.1.6 FuzzyQuery is used to search for similar terms and the Levenshtein algorithm is used. Suppose you want to search for words similar to 'wuzza', you can: Query query = new FuzzyQuery (new Term ("content", "wuzza "); you may get 'fuzzy 'and 'wuyun '. 7.1.7 another common Query of RangeQuery is RangeQuery. You may want to search for documents in the time range from 20060101 to 20060130. You can use RangeQuery: rangeQuery query = new RangeQuery (new Term ("time", "20060101"), new Term ("time", "20060130"), true ); the final true value indicates that the closed interval is used. 7.2 When QueryParser reads so many queries, you may ask: "I won't let myself combine various queries. It's too much trouble !" Of course not. lucene provides a query statement similar to an SQL statement. Let's call it a lucene statement. through it, you can get all kinds of queries in one sentence, lucene Automatically splits them into small pieces and submits them to the corresponding Query for execution. Next we will demonstrate each type of Query: TermQuery can use the "field: key" method, for example, "content: lucene ". In BooleanQuery, 'and' Use '+', 'or' Use '', for example," content: java contenterl ". WildcardQuery still uses '? 'And' * ', for example, "content: use *". PhraseQuery '~ ', Such as "content:" China and Japan "~ 5 ". PrefixQuery uses '*', for example, "medium *". FuzzyQuery '~ ', Such as "content: wuzza ~". RangeQuery uses '[]' or '{}'. The former indicates a closed interval, and the latter indicates an open interval. For example, "time: [20060101 TO 20060130]". Note that TO is case sensitive. You can combine query strings to perform complex operations. For example, "the title or body includes lucene, and the time range is between 20060101 and 20060130." can be expressed as "+ (title: lucene content: lucene) + time: [20060101 TO 20060130] ". The Code is as follows: Directory dir = FSDirectory. getDirectory (PATH, false); IndexSearcher is = new IndexSearcher (dir); QueryParser parser = new QueryParser ("content", new StandardAnalyzer (); Query query Query = parser. parse ("+ (title: lucene content: lucene) + time: [20060101 TO 20060130]"; Hits hits = is. search (query); for (int I = 0; I

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More