Before you go, share some information.
first of all, to learn any new or old open source technology, Baidu One or two is the simplest way, first understand the approximate, thought and so on . Here to contribute a very good presentation of PPT. I've turned it into a PDF for easy searching.
Secondly, for the first time programming, it is recommended to check the official information . Baidu to the data, currently Lucene has been updated to 4.9 version, this version requires more than 1.7 jdk, so if you still use 1.6 or even 1.5 small pot friends, please refer to the lower version, because I use the 1.6, so in the use of Lucene4.0.
This is Lucene4.0 's Official document:http://lucene.apache.org/core/4_0_0/core/overview-summary.html
Here very admire Lucene's Kaiyuan contributor, can read Lucene in Action, the author originally wanted to write software to make money, and finally contributed to the Apache, off-topic.
Finally, remind the small pots of the learning Lucene, this open source software version update is not slow, version of the programming style is also different, so if Baidu to the post, maybe this code, with 4.0 or 3.6 will not work.
For example, when the previous version of the application IndexWriter, this is the case:
IndexWriter IndexWriter =
But 4.0, we need to configure a conf to put the configuration content in this object:
Indexwriterconfig config = new Indexwriterconfig (version.lucene_current, analyzer); IndexWriter iwriter = new IndexWriter (directory, config);
Therefore, be sure to refer to the Official document programming style, to write code .
Finally, the official web site downloaded from the file, has been uploaded to Baidu network disk, welcome to download.
This is one of the five most commonly used files:
The first, and most important,Lucene-core-4.0.0.jar, which includes commonly used documents, indexes, searches, storage and other related core code.
The second,Lucene-analyzers-common-4.0.0.jar, contains lexical analyzers for various languages that are used to slice and extract the contents of a file.
Third,Lucene-highlighter-4.0.0.jar, this jar package is used primarily for searching for content highlighting.
The fourth and fifth,Lucene-queryparser-4.0.0.jar, provide search-related code for various searches, such as fuzzy Search, range search, and so on.
Nonsense here, let's briefly explain what is
Full Text Search。
For example, we have a folder, or a disk with a lot of files, Notepad, world, Excel, PDF, we want to search for the included files based on the keywords. For example, if we enter Lucene, all the files containing Lucene will be checked out. This is called full-text search.
Therefore, it is very easy to think that we should establish a keyword and file mapping , misappropriation of a picture in the PPT, it is very clear how this mapping is implemented.
In Lucene, this "inverted index" technique is used to implement the relevant mappings.
With this mapping, let's take a look at
Design of Lucene architecture。
The following is a picture of the data of Lucene, but it is also a generalization of its essence.
We can see that the use of Lucene is mainly embodied in two steps:
1 Create an index, create an index of different files by IndexWriter, and save it in the location of the index-related file store.
2 Search keywords related documents by index.
Below for an example of the official website above, for analysis:
1 Analyzer Analyzer = new StandardAnalyzer (version.lucene_current); 2 3//Store the index in memory:4 directory directory = new ramdirectory (); 5//To store a index on disk with this instead:6//directory Directory = Fsdirectory.open ("/tmp/testindex"); 7 indexwriterconfig config = new Indexwriterconfig (version.lucene_current, analyzer); 8 IndexWriter iwriter = new IndexWriter (directory, config); 9 Document doc = new document (); String text = "The text to be indexed."; One Doc.add (New Field ("FieldName", Text, textfield.type_stored)); Iwriter.adddocument (doc); Iwriter.close () ; +/Now search the index:16 directoryreader ireader = directoryreader.open (directory); indexsearch Er isearcher = new Indexsearcher (ireader);//Parse A simple query this searches for "text": Queryparser parse R = new Queryparser (version.lucene_current, "fieldname", analyzer), query query = parser.parse ("text"); 21 Scoredoc[] hits = isearcher.search (query, NULL, N). Scoredocs;22 assertequals (1, hits.length);//Iterate Through the results:24 for (int i = 0; i < hits.length; i++) {Document Hitdoc = Isearcher.doc (hits[i].doc ); Assertequals ("The text to be indexed.", Hitdoc.get ("FieldName"));}28 Ireader.close (); 29 Directory.close ();
Creation of indexes
First, we need to define a lexical parser.
Like a sentence, "I love our china!" ", how to split, buckle down the pause word", extract the keyword "i" "We" "China" and so on. This is accomplished with the help of the Lexical Analyzer Analyzer. This is used in the standard lexical analyzer, if specifically for Chinese, can also be used with paoding.
1 Analyzer Analyzer = new StandardAnalyzer (version.lucene_current);
The version.lucene_current in the parameter, which represents the use of the current LUCENE version, can also be written as version.lucene_40 in this context.
The second step is to determine the location of the index file storage, which Lucene provides to us in two ways:
1 Local file storage
Directory directory = fsdirectory.open ("/tmp/testindex");
2 Memory storage
Directory directory = new ramdirectory ();
Can be set according to your own needs.
The third step is to create the IndexWriter and write the index file.
Indexwriterconfig config = new Indexwriterconfig (version.lucene_current, analyzer); IndexWriter iwriter = new IndexWriter (directory, config);
Here Indexwriterconfig, according to the official documentation, is the configuration of the IndexWriter, which contains two parameters, the first is the current version, and the second is the Lexical Analyzer Analyzer.
The fourth step, the content extraction, carries on the index storage.
Document doc = new document (); String Text = "This was the text to be indexed."; Doc.add (New Field ("FieldName", Text, textfield.type_stored)); Iwriter.adddocument (doc); Iwriter.close ();
The first line, which applies a Document object, is similar to a row in a table in the database.
The second line is the string we are about to index.
On the third line, store the string (because textfield.type_stored is set, if you do not want to store it, you can use other parameters, refer to the official document for details), and store "show" as "FieldName".
Row four, add the Doc object to the index creation.
Five lines, close the IndexWriter, submit the creation content.
This is the process of index creation.
Keyword query:
The first step is to open the storage location
Directoryreader Ireader = directoryreader.open (directory);
Second step, create the Finder
Indexsearcher isearcher = new Indexsearcher (Ireader);
The third step, similar to SQL, for keyword query
Queryparser parser = new Queryparser (version.lucene_current, "fieldname", analyzer); Query query = parser.parse ("text"); Scoredoc[] hits = isearcher.search (query, NULL, N). Scoredocs;assertequals (1, hits.length); for (int i = 0; i < HITS.L Ength; i++) { Document Hitdoc = Isearcher.doc (hits[i].doc); Assertequals ("This is the text to be indexed.", Hitdoc.get ("FieldName"));}
Here, we create a query and set its lexical parser, and the query's "table name" is "FieldName". The query results return a collection of SQL-like resultset, where we can extract the content stored in it.
For a variety of different query methods, you can refer to the official manual, or the recommended PPT
Fourth step, turn off the Finder, and so on.
Ireader.close ();d irectory.close ();
Finally, Bo Pig wrote a simple example, you can index the contents of a folder to create, and according to the keyword filter files, and read the contents .
To create an index:
/** * CREATE index of current file directory * @param path Current file directory * @return succeeded */public static Boolean CreateIndex (String path) {Date date1 = new Date (); list<file> fileList = getfilelist (path); for (File file:filelist) {content = ""; Gets the file suffix String type = file.getname (). substring (File.getname (). LastIndexOf (".") +1); if ("TXT". Equalsignorecase (Type)) {content + = txt2string (file); }else if ("Doc". Equalsignorecase (Type)) {content + = doc2string (file); }else if ("xls". Equalsignorecase (Type)) {content + = xls2string (file); } System.out.println ("Name:" +file.getname ()); System.out.println ("Path:" +file.getpath ());//System.out.println ("content:" +content); System.out.println (); Try{Analyzer = new StandardAnalyzer (version.lucene_current); Directory = Fsdirectory.open (new File (Index_dir)); File Indexfile = new file (Index_dir); if (!indexfile.exists ()) {indexfile.mkdirs (); } indexwriterconfig config = new Indexwriterconfig (version.lucene_current, analyzer); IndexWriter = new IndexWriter (directory, config); Document document = new document (); Document.add (New TextField ("FileName", File.getname (), store.yes)); Document.add (New TextField ("content", content, Store.yes)); Document.add (New TextField ("Path", File.getpath (), store.yes)); Indexwriter.adddocument (document); Indexwriter.commit (); Closewriter (); }catch (Exception e) {e.printstacktrace (); } Content = ""; } Date Date2 = new Date (); SYSTEM.OUT.PRINTLN ("CREATE INDEX-----Time:" + (Date2.gettime ()-date1.gettime ()) + "ms\n"); return true; }
To query:
/** * Find index, return eligible files * @param text lookup String * @return eligible files list */public static void Searchindex (Strin G text) {Date date1 = new Date (); try{directory = Fsdirectory.open (new File (Index_dir)); Analyzer = new StandardAnalyzer (version.lucene_current); Directoryreader Ireader = directoryreader.open (directory); Indexsearcher isearcher = new Indexsearcher (Ireader); Queryparser parser = new Queryparser (version.lucene_current, "content", analyzer); Query query = parser.parse (text); Scoredoc[] hits = isearcher.search (query, NULL, n). Scoredocs; for (int i = 0; i < hits.length; i++) {Document Hitdoc = Isearcher.doc (Hits[i].doc); System.out.println ("____________________________"); System.out.println (Hitdoc.get ("filename")); System.out.println (Hitdoc.get ("content")); System.Out.println (Hitdoc.get ("path")); System.out.println ("____________________________"); } ireader.close (); Directory.close (); }catch (Exception e) {e.printstacktrace (); } Date Date2 = new Date (); SYSTEM.OUT.PRINTLN ("View Index-----Time:" + (Date2.gettime ()-date1.gettime ()) + "ms\n"); }
All code:View CodeOperation Result:
All files that contain the man keyword are filtered out.
java--Full-Text Search framework--lucene