Lucene, lucene
I have heard of Lucene since I knew Hadoop, but I have never taken the time to study it well. I recently had some free time. I decided to make up for these things. I don't want to know anything about it, but I 'd like to know more about it.
The summary description of Lucene is not replicated much. In short, it can be used to quickly create and retrieve indexes. It is a well-designed framework.
Lucene is easy to use. You can download the Lucene package online and import the project to use it. I use the javase4.2 package. After downloading the package, unzip the package to the following directories:
Simple Lucene program, only the introduction of core under the lucene-core.jar and analysis of the common inside the lucene-analyzers-common.jar can be.
Lucene is well designed, and advanced users can customize it in detail. However, if beginner users (such as me) use it, it can also be very useful, I will post my example here.
1/** 2 * index all txt files in a folder (excluding subfolders) and store them in the index folder. 3*4 * @ param rootPath 5 */6 public static void buildIndex (String rootPath) throws IOException {7 // specify the index file Directory 8 Directory indexDir = FSDirectory. open (new File (rootPath + "/index"); 9 File dataDir = new File (rootPath); 10 File [] dataFiles = dataDir. listFiles (); 11 // constructor 12 Analyzer luceneAnalyzer = new StandardAnalyzer (Version. required e_42); 13 // use analyzer to construct index generator 14 IndexWriter indexWriter = new IndexWrit Er (indexDir, new IndexWriterConfig (Version. LUCENE_42, luceneAnalyzer); 15 indexWriter. deleteAll (); 16 // start to generate the Index 17 for (int I = 0; I <dataFiles. length; I ++) {18 if (dataFiles [I]. isFile () & dataFiles [I]. getName (). endsWith (". txt ") {19 Document document = new Document (); 20 // use the file name as the content of the path Field, Field. store. YES indicates that it will be stored in the index, so that the content can be directly queried through the index 21 document. add (new StringField ("path", dataFiles [I]. getCanonicalPath (), Field. store. YES); 22 // use the file content as the content of the contents domain, Field. store. NO indicates not to store it in the index. In this way, the index can be queried based on its keyword, but the specific content cannot be found. 23 document. add (new TextField ("contents", FileUtils. readFileToString (dataFiles [I], "GBK"), Field. store. NO); 24 indexWriter. addDocument (document); 25} 26} 27 // combine commit and close into one. 28 indexWriter. close (); 29} 30/** 31 * search 32*33 * @ param indexDir index file storage location 34 * @ param keyword index keyword 35 * @ param numberOfRecords several entries need to be queried record 36 */37 public static List <String> search (String indexDir, string keyword, int numberOfRecords) throws IOException {38 // open the index file 39 Directory indexDirectory = FSDirectory. open (new File (indexDir); 40 DirectoryReader ireader = DirectoryReader. open (indexDirectory); 41 // construct an index query42 IndexSearcher isearcher = new IndexSearcher (ireader ); 43 // construct Query statement 44 query Query = new TermQuery (new Term ("contents", keyword); 45 // query 46 ScoreDoc [] hits = isearcher. search (query, null, numberOfRecords ). scoreDocs; 47 List <String> result = new ArrayList <String> (); 48 // traverse Query result 49 for (ScoreDoc scoreDoc: hits) 50 result.add(isearcher.doc(scoreDoc.doc ). get ("path"); 51 return result; 52}
After completing the above code, you can index and query all files in a folder. After testing, the index of English files is quite powerful.
Here, I will explain my understanding of Lucene's principles. The indexing process of Lucene is as follows:
1. Perform word segmentation on the content to be indexed to obtain keywords.
Two pairs of keywords are processed. The processing process includes: filtering, filtering out modal auxiliary words, such as and, or, to, and so on. Case conversion, because when people may want to query "Computer, you can query "computer", "COMPUTER", and other related words. For example, you can query "wants" and "wanted" When querying "want.
3. Create an index based on the keywords, that is, to record the documents in which a keyword has appeared.
The general process is like this, and the optimization in more detail is not yet viewed. I want to go over it again when I need to use it in depth in the future.
After the above steps are completed, you can perform the query.
If you are very careful, you will find that the above special emphasis on "the index of English files is quite powerful", it means that the index of Chinese files is quite bad, this is because the English word segmentation is relatively easy. It can be segmented based on the space characters and punctuation marks, but the Chinese word segmentation is more difficult (it can be Baidu's "there are times when the rain breaks down words ", therefore, if you need to create an index for a Chinese file, you need to use other word segmentation packages. The better ones are the SmartCN and IKAnalyzer provided by Lucene. These two word segmentation packages are also relatively simple to use:
When you need to use SmartCN, introduce the SmartCN package (under smartcn under analysis), and then change the 12th lines of the above Code:
1 Analyzer luceneAnalyzer = new SmartChineseAnalyzer(Version.LUCENE_42);
You can.
When IKAnalyzer needs to be used, introduce the IKAnalyzer package (self-search on the Internet), and then change the 12th line of the above Code:
1 Analyzer luceneAnalyzer = new IKAnalyzer();
You can.
Thanks again for the great object-oriented invention!
The above is my current understanding of Lucene. If Lucene is actually used in the project, study it in depth.
This article involves Code that has been shared on Oschina's Git, link: http://git.oschina.net/xdxn/Test
References:
Comparison of open-source Chinese Word Segmentation framework word segmentation effects between smartcn and IKanalyzer
How Lucene works