Document directory
- 1. Use Lucene to write indexes to memory
Original Works are allowed to be reprinted. During reprinting, please mark the article in hyperlink form
Source, author information, and my statement. Otherwise, legal liability will be held.
Author: Permanent reference_☆address: http://blog.csdn.net/chenghui0317/article/details/10052103一eluceneintroduction
Lucene is a framework for full-text search. Apache provides an open-source project for full-text search engines implemented in Java. It has powerful functions and simple APIs, with the full-text search function, you can easily search for the content of the entire application system based on keywords, greatly improving the user experience. Using Lucene to create, search, and operate databases is a bit like this. It can be imagined that Lucene is quite convenient to use.
So why use Lucene? If Lucene is not used, you need to search for database table records based on a keyword. You need to use like to match a character and a character. In this way, the query method must be exhausted by programmers, the performance overhead of querying databases can be imagined.
Ii. Lucene Execution Process
The operations of Lucene are similar to those of databases. Therefore, if you want to use Lucene, you must first create a "Database" and then insert a row of data into this "data table, after the data is successfully inserted, you can operate on this "data table" to add, delete, modify, and query data.
In general, it can be understood as follows:
1. Create an index file directory, encapsulate the information to be searched into a Document Object matching the field, and put the object into the index file directory, here, you can store indexes in disks or in memory. If you store indexes in memory, the program closes the index and generally stores the indexes on disks;
2. If the information is found to be faulty and needs to be deleted, the index file should also be deleted; otherwise, the corresponding index will still be queried. In this case, the corresponding index should be deleted based on the index ID;
3. If the information is updated, the index file should also be updated. In this case, you must first Delete the old index and then add a new index;
4. Full-text search is the final major concern. Like querying a database, you must first create an index to read the object, encapsulate the query object, and call the search () method to obtain the search result.
Iii. Prerequisites for using Lucene
Lucene-core-3.6.0.jar
Lucene-highlighter-3.6.0.jar
Lucene-memory-3.6.0.jar
: Http://download.csdn.net/detail/ch656409110/5971413
Iv. Practical Use of Lucene
1. Use Lucene to write indexes to memory
The implementation idea is as follows:
<1> created the memory directory object ramdirectory and the index writer indexwriter;
<2> use the index writer to store specified data into the memory directory object;
<3> Create an indexsearch index query object and encapsulate the query object based on the keyword;
<4> call the search () method and return the query result to topdocs. All document objects in the iteration are displayed;
<5> disable the indexwriter writer and the ramdirectory object.
The Code is as follows:
Package COM. lucene. test; import Java. io. ioexception; import Org. apache. lucene. analysis. simpleanalyzer; import org.apache.e.doc ument. document; import org.apache.e.doc ument. field; import Org. apache. lucene. index. indexwriter; import Org. apache. lucene. index. term; import Org. apache. lucene. search. indexsearcher; import Org. apache. lucene. search. query; import Org. apache. lucene. search. termquery; import Org. AP Ache. lucene. search. topdocs; import Org. apache. lucene. store. ramdirectory;/*** example of using Lucene to retrieve memory indexes ** @ author administrator **/public class ramdirectorydemo {public static void main (string [] ARGs) throws ioexception {long starttime = system. currenttimemillis (); system. out. println ****************** ****"); // create a memory directory object, so the index generated here is not stored in the disk, but in the memory. Ramdirectory directory = new ramdirectory ();/** create an index and write it to an object that can write the index to the disk or to the memory. Parameter description: * Public indexwriter (Directory D, analyzer A, Boolean create, maxfieldlength MFL) * Directory: Directory object, or fsdirectory disk directory object * Analyzer: Word divider, the word divider divides search keywords into a group of phrases. It is one of the major features of Lucene search and query. New simpleanalyzer () is the simplest word divider provided by Lucene; Create: whether to create a new word. Here, it must be set to true; * maxfieldlength: the maximum length of the word divider split, because different types of word divider split have different levels of granularity, therefore, you must set the longest split length. Indexwriter. maxfieldlength. unlimited indicates no limit; */indexwriter writer = new indexwriter (directory, new simpleanalyzer (), true, indexwriter. maxfieldlength. unlimited); // create a document object. The index created in Lucene can be regarded as a table in the database, and the table can also contain fields, after adding the content to it, you can match the search results based on the fields. // The following three fields are added to the created Doc object: name, sex, dosomething, document Doc = new document ();/** parameter description public field (string name, string value, store, Index) * Name: field name * Value: Field Value Store: * field. store. yes: stores the field value (the field value before word segmentation) field. store. no: No storage. The storage has nothing to do with the index. * field. store. compress: Compressed Storage, used for long text or binary files, but with poor performance * index: index creation method, whether to create Word Segmentation, and so on * field. index. analyzed: Word Segmentation Index * field. index. analyzed_no_norms: used to create an index based on word segmentation. However, the field value is not saved as usual, but only one byte is used to save storage space * field. index. not_analyzed. index. not_analyzed_no_norms: creates an index without word segmentation. The field value goes to one Byte save */Doc. add (new field ("name", "Chenghui", field. store. yes, field. index. analyzed); Doc. add (new field ("sex", "male", field. store. yes, field. index. not_analyzed); Doc. add (new field ("dosometing", "I am learning Lucene", field. store. yes, field. index. analyzed); writer. adddocument (DOC); writer. close (); // you can disable it in advance because dictory has nothing to do with indexwriter after writing data to the memory. // test the index immediately after it is stored in the memory. Otherwise, when the application is closed, the retrieval fails. // Create an indexsearcher index retrieval object, which must pass the previously written memory directory object directory indexsearcher searcher = new indexsearcher (directory); // encapsulate a term combination object based on the search keyword, then it is encapsulated into a query object // dosometing is the field defined above, Lucene is the search keyword query = new termquery (new term ("dosometing", "Lucene ")); // query = new termquery (new term ("sex", "male"); // query = new termquery (new term ("name ", "Cheng"); // query in the index directory. The returned result is the topdocs object, which stores the document Topdocs rs = searcher. search (query, null, 10); long endtime = system. currenttimemillis (); system. out. println ("total cost" + (endtime-starttime) + "millisecond, retrieved" + Rs. totalhits + "record. "); For (INT I = 0; I <Rs. scoredocs. length; I ++) {// rs.scoredocs? I =.doc is the ID of the flag in the index. The document firsthit = searcher.doc(rs.scoredocs= I #.doc) is recorded from 0. system. out. println ("name:" + firsthit. getfield ("name "). stringvalue (); system. out. println ("Sex:" + firsthit. getfield ("sex "). stringvalue (); system. out. println ("dosomething:" + firsthit. getfield ("dosometing "). stringvalue ();} writer. close (); directory. close (); system. out. println ****************** ****");}}
The running result is as follows:
We can see from the above: the query is successful based on the "Lucene" keyword, and the returned objects are document-encapsulated objects.
In addition, if the create parameter is set to false when the index writer indexwriter is created, an error is returned. The index file cannot be found because each read operation is in the "existing" mode, as shown below:
Exception in thread "Main" org. apache. lucene. index. indexnotfoundexception: No segments * file found in org. apache. lucene. store. ramdirectory @ 156ee8e lockfactory = org. apache. lucene. store. singleinstancelockfactory @ 47bfactory:
Files: []
The query is successful Based on dosomething, and the query function can be implemented based on the same sex field and Name field.
<1> If query = new termquery (new term ("sex", "male") is used, cancel the comment and query:
The result shows that there is no record at all, because the specified sex field is field when the index is generated. index. not_analyzed type, so Lucene does not create an index for this field, so it cannot be queried based on the sex field.
<2> change the sex field to the field. Index. Analyzed type, and then query:
No records are found. Why?
This is because simpleanalyzer is not so intelligent. It only performs word segmentation and matching for phrases with spaces in keywords. In short, if "China" is a search keyword, then it will only match the phrase "China" in the corresponding field in the index table, and the Chinese characters containing "medium" or "country" will not be matched,
Remember that there are spaces in front of them. Similarly, if the data in the created index contains "I am Chinese", the keyword "China" cannot be matched, because simpleanalyzer is a word divider that only matches phrases separated by spaces, to make the matching successful, the data to be indexed should be changed to "I am a Chinese man ", this will be retrieved.
Therefore, to use the keyword "male" to match successfully, you must change it to "male" when adding an index.
Now that the Chinese characters are like this, try to see if the English letters are the same.
Sure enough:
<1> If the index of the Name field is "Chenghui" and the keyword is "Cheng", no search is found;
<2> If the index of the Name field is specified as the "cheng hui" keyword, you can search for it;
<3> If the index of the input name field is "cheng hui", the keyword "Cheng" cannot be found;
<4> If the index of the input name field is "cheng hui", the keyword "Cheng" cannot be found;
<5> If the index of the Name field is specified as the "cheng hui" keyword, you can search for it;
It can be seen that the index is converted to lowercase at the time of input, but the keyword is not converted to lowercase for matching, which leads to case-insensitive matching.
Therefore, English and numbers are not as user-friendly as Chinese, and even letters are not case-sensitive during retrieval, while numbers are also sub-problems, it can be seen that Lucene is not perfect for its search.
Is there a solution? Yes. In the preceding example, simpleanalyzer is used as the word divider. This word divider is used for segmentation. Now we will introduce another word divider standardanalyzer, the standard word segmentation achieves the same effect as chineseanalyzer.
Replace the token object passed in the above Code with new standardanalyzer (version. lucene_36), try the results and find that standardanalyzer is optimized based on simpleanalyzer. Unfortunately, it is only in Chinese, for example:
<1> If the keyword "male" is specified as "male" when the sex index is entered, search can be found;
However, in this case, full-text search is far from required. Therefore, a more advanced word divider is required to implement this function.