Reproduced Preliminary study on Apache Lucene

Source: Internet
Author: User
Tags date1

Reprinted from Http://www.cnblogs.com/xing901022/p/3933675.html

Before you go, share some information.

  first of all, to learn any new or old open source technology, Baidu One or two is the simplest way, first understand the approximate, thought and so on . Here to contribute a very good presentation of PPT. I've turned it into a PDF for easy searching.

  Secondly, for the first time programming, it is recommended to check the official information . Baidu to the data, currently Lucene has been updated to 4.9 version, this version requires more than 1.7 jdk, so if you still use 1.6 or even 1.5 small pot friends, please refer to the lower version, because I use the 1.6, so in the use of Lucene4.0.

This is Lucene4.0 's Official document:http://lucene.apache.org/core/4_0_0/core/overview-summary.html

Here very admire Lucene's Kaiyuan contributor, can read Lucene in Action, the author originally wanted to write software to make money, and finally contributed to the Apache, off-topic.

  Finally, remind the small pots of the learning Lucene, this open source software version update is not slow, version of the programming style is also different, so if Baidu to the post, maybe this code, with 4.0 or 3.6 will not work.

For example, when the previous version of the application IndexWriter, this is the case:

IndexWriter IndexWriter  =   

But 4.0, we need to configure a conf to put the configuration content in this object:

    Indexwriterconfig config = new Indexwriterconfig (version.lucene_current, analyzer);    IndexWriter iwriter = new IndexWriter (directory, config);

Therefore, be sure to refer to the Official document programming style, to write code .

  Finally, the official web site downloaded from the file, has been uploaded to Baidu network disk, welcome to download.

  

This is one of the five most commonly used files:

The first, and most important,Lucene-core-4.0.0.jar, which includes commonly used documents, indexes, searches, storage and other related core code.

The second,Lucene-analyzers-common-4.0.0.jar, contains lexical analyzers for various languages that are used to slice and extract the contents of a file.

Third,Lucene-highlighter-4.0.0.jar, this jar package is used primarily for searching for content highlighting.

The fourth and fifth,Lucene-queryparser-4.0.0.jar, provide search-related code for various searches, such as fuzzy Search, range search, and so on.

  

Back to the top of the crap here, here's a brief explanation of what's Full Text Search

  

For example, we have a folder, or a disk with a lot of files, Notepad, world, Excel, PDF, we want to search for the included files based on the keywords. For example, if we enter Lucene, all the files containing Lucene will be checked out. This is called full-text search.

Therefore, it is very easy to think that we should establish a keyword and file mapping , misappropriation of a picture in the PPT, it is very clear how this mapping is implemented.

In Lucene, this "inverted index" technique is used to implement the relevant mappings.

  

Back to the top with this mapping, we'll take a look at Design of Lucene architecture

The following is a picture of the data of Lucene, but it is also a generalization of its essence.

We can see that the use of Lucene is mainly embodied in two steps:

  1 Create an index, create an index of different files by IndexWriter, and save it in the location of the index-related file store.

2 Search keywords related documents by index.

Below for an example of the official website above, for analysis:

 1 Analyzer Analyzer = new StandardAnalyzer (version.lucene_current); 2 3//Store the index in memory:4 directory directory = new ramdirectory (); 5//To store a index on disk with this instead:6//directory Directory = Fsdirectory.open ("/tmp/testindex"); 7 indexwriterconfig config = new Indexwriterconfig (version.lucene_current, analyzer); 8 IndexWriter iwriter = new IndexWriter (directory, config); 9 Document doc = new document (); String text = "The text to be indexed."; One Doc.add (New Field ("FieldName", Text, textfield.type_stored)); Iwriter.adddocument (doc); Iwriter.close () ; +/Now search the index:16 directoryreader ireader = directoryreader.open (directory); indexsearch Er isearcher = new Indexsearcher (ireader);//Parse A simple query this searches for "text": Queryparser parse R = new Queryparser (version.lucene_current, "fieldname", analyzer), query query = parser.parse ("text"); 21     Scoredoc[] hits = isearcher.search (query, NULL, N). Scoredocs;22 assertequals (1, hits.length);//Iterate Through the results:24 for (int i = 0; i < hits.length; i++) {Document Hitdoc = Isearcher.doc (hits[i].doc     ); Assertequals ("The text to be indexed.", Hitdoc.get ("FieldName"));}28 Ireader.close (); 29 Directory.close ();

  

Back to the top of the index creation

  First, we need to define a lexical parser.

Like a sentence, "I love our china!" ", how to split, buckle down the pause word", extract the keyword "i" "We" "China" and so on. This is accomplished with the help of the Lexical Analyzer Analyzer. This is used in the standard lexical analyzer, if specifically for Chinese, can also be used with paoding.

1 Analyzer Analyzer = new StandardAnalyzer (version.lucene_current);

The version.lucene_current in the parameter, which represents the use of the current LUCENE version, can also be written as version.lucene_40 in this context.

  

  The second step is to determine the location of the index file storage, which Lucene provides to us in two ways:

1 Local file storage

Directory directory = fsdirectory.open ("/tmp/testindex");

2 Memory storage

Directory directory = new ramdirectory ();

Can be set according to your own needs.

   

 The third step is to create the IndexWriter and write the index file.

Indexwriterconfig config = new Indexwriterconfig (version.lucene_current, analyzer); IndexWriter iwriter = new IndexWriter (directory, config);

Here Indexwriterconfig, according to the official documentation, is the configuration of the IndexWriter, which contains two parameters, the first is the current version, and the second is the Lexical Analyzer Analyzer.

  

  The fourth step, the content extraction, carries on the index storage.

Document doc = new document (); String Text = "This was the text to be indexed."; Doc.add (New Field ("FieldName", Text, textfield.type_stored)); Iwriter.adddocument (doc); Iwriter.close ();

The first line, which applies a Document object, is similar to a table in a database.

The second line is the string we are about to index.

On the third line, store the string (because textfield.type_stored is set, if you do not want to store it, you can use other parameters, refer to the official document for details), and store "show" as "FieldName".

Row four, add the Doc object to the index creation.

Five lines, close the IndexWriter, submit the creation content.

  

This is the process of index creation.

Back to top keyword query:

  The first step is to open the storage location

Directoryreader Ireader = directoryreader.open (directory);

Second step, create the Finder

Indexsearcher isearcher = new Indexsearcher (Ireader);

The third step, similar to SQL, for keyword query

Queryparser parser = new Queryparser (version.lucene_current, "fieldname", analyzer); Query query = parser.parse ("text"); Scoredoc[] hits = isearcher.search (query, NULL, N). Scoredocs;assertequals (1, hits.length); for (int i = 0; i < HITS.L Ength; i++) {    Document Hitdoc = Isearcher.doc (hits[i].doc);    Assertequals ("This is the text to be indexed.", Hitdoc.get ("FieldName"));}

Here, we create a query and set its lexical parser, and the query's "table name" is "FieldName". The query results return a collection of SQL-like resultset, where we can extract the content stored in it.

For a variety of different query methods, you can refer to the official manual, or the recommended PPT

 Fourth step, turn off the Finder, and so on.

Ireader.close ();d irectory.close ();

Finally, Bo Pig wrote a simple example, you can index the contents of a folder to create, and according to the keyword filter files, and read the contents .

Back to top CREATE INDEX:

  

/** * CREATE index of current file directory * @param path Current file directory * @return succeeded */public static Boolean CreateIndex (String path)        {Date date1 = new Date ();        list<file> fileList = getfilelist (path);            for (File file:filelist) {content = ""; Gets the file suffix String type = file.getname (). substring (File.getname (). LastIndexOf (".")            +1);                        if ("TXT". Equalsignorecase (Type)) {content + = txt2string (file);                        }else if ("Doc". Equalsignorecase (Type)) {content + = doc2string (file);                            }else if ("xls". Equalsignorecase (Type)) {content + = xls2string (file);            } System.out.println ("Name:" +file.getname ());            System.out.println ("Path:" +file.getpath ());//System.out.println ("content:" +content);                                    System.out.println (); Try{Analyzer = new StandardAnalyzer (version.lucene_current);                    Directory = Fsdirectory.open (new File (Index_dir));                File Indexfile = new file (Index_dir);                if (!indexfile.exists ()) {indexfile.mkdirs ();                } indexwriterconfig config = new Indexwriterconfig (version.lucene_current, analyzer);                                IndexWriter = new IndexWriter (directory, config);                Document document = new document ();                Document.add (New TextField ("FileName", File.getname (), store.yes));                Document.add (New TextField ("content", content, Store.yes));                Document.add (New TextField ("Path", File.getpath (), store.yes));                Indexwriter.adddocument (document);                Indexwriter.commit ();                                Closewriter ();            }catch (Exception e) {e.printstacktrace (); } Content = "";        } Date Date2 = new Date ();        SYSTEM.OUT.PRINTLN ("CREATE INDEX-----Time:" + (Date2.gettime ()-date1.gettime ()) + "ms\n");    return true; }
Go back to the top for a query:
/** * Find index, return eligible files * @param text lookup String * @return eligible files list */public static void Searchindex (Strin        G text) {Date date1 = new Date ();            try{directory = Fsdirectory.open (new File (Index_dir));            Analyzer = new StandardAnalyzer (version.lucene_current);            Directoryreader Ireader = directoryreader.open (directory);                Indexsearcher isearcher = new Indexsearcher (Ireader);            Queryparser parser = new Queryparser (version.lucene_current, "content", analyzer);                        Query query = parser.parse (text);                    Scoredoc[] hits = isearcher.search (query, NULL, n). Scoredocs;                for (int i = 0; i < hits.length; i++) {Document Hitdoc = Isearcher.doc (Hits[i].doc);                System.out.println ("____________________________");                System.out.println (Hitdoc.get ("filename"));                System.out.println (Hitdoc.get ("content")); System.Out.println (Hitdoc.get ("path"));            System.out.println ("____________________________");            } ireader.close ();        Directory.close ();        }catch (Exception e) {e.printstacktrace ();        } Date Date2 = new Date ();    SYSTEM.OUT.PRINTLN ("View Index-----Time:" + (Date2.gettime ()-date1.gettime ()) + "ms\n"); }
Go back to the top of all the code:View CodeBack to top running results:

All files that contain the man keyword are filtered out.

  

  

Back to top references

Java Read text encyclopedia: http://blog.csdn.net/csh624366188/article/details/6785817

Lucene Official Document: http://lucene.apache.org/core/4_0_0/core/overview-summary.html

  

Reproduced Preliminary study on Apache Lucene

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.