Share some of the insights about Lucene

Source: Internet
Author: User

Lucene the Overview

Lucene is a full-text search framework, not an app product. so it doesn't work like http://www.baidu.com/or Google Desktop, it just provides a tool to enable you to implement these products.

What Lucene can do

Lucene is a very simple function, after all, you give it a number of strings, and then it provides you with a full-text search service to tell you where the keywords you want to search appear. Knowing the nature, you can imagine doing anything that fits this condition. You can index the news in the station, and make a database; You can index several fields of a table, so you don't have to worry about locking the table because of "%like%"; you can also write your own search engine ...

2 How Lucene works

The services provided by Lucene actually consist of two parts: one in one out. The so-called entry is written, the source you provide (essentially a string) is written to the index or deleted from the index, so-called read out, that is, to provide users with full-text search services, so that users can locate the source by keyword.

Write process

1. The source string is first processed by the analyzer, including: participle, divided into words; remove stopword (optional).

2. Add the required information from the source to each field in the document, and index the field that needs to be indexed to store the fields that need to be stored.

3. Write the index to memory, which can be either memory or disk.

The main key word "function"

A) analysis: Word breaker

The analysis contains some built-in analyzers, such as the Whitespaceanalyzer of a blank character participle, with the addition of Stopwrod filtered Stopanalyzer, the most commonly used standardanalyzer.

b) Documet: Documentation

is the packaging structure of our source data, we need to separate the source data into different domains, put into the Documet, and when searching can also specify which fields (field) to search.

c) directory: directory, this is an abstraction of the directory, this directory can be a file system above the Dir (fsdirectory), or can be a piece of memory (Ramdirectory), Mmapdirectory is an index that uses memory mapping.

If you put it in memory, you will avoid the time-consuming operation of the IO and choose it as needed.

d) IndexWriter : The index writer, the maintainer, the class that reads and deletes the index

e) indexreader : Index reader for reading the index of the specified directory.

f) Indexsearcher : The index of the search engine, that is, the user input to the index list of the search for a class

It is important to note that this search is the (topdocs) index number, not a real article.

g) query: Querying the statement, we need to wrap our query string into queries can be handed to searcher to search, the smallest unit of the query is term,lucene a lot of query, Choose a different query according to the different needs.

I. termquery:

If you want to execute a query like this: "Include the document ' Lucene ' in the content domain, then you can use Termquery:

Term T = new term ("content", "Lucene"); Query query = new Termquery (t);

II. Booleanquery: Queries for multiple query "and or" relationships

If you want to query this: "Include Java or Perl document in the content domain", then you can create two termquery and connect them with Booleanquery:

Termquery termQuery1 = new Termquery (New term ("content", "Java"); Termquery termquery 2 = new Termquery (New term ("content", "Perl"); Booleanquery booleanquery = new Booleanquery (); Booleanquery.add (TermQuery1, BooleanClause.Occur.SHOULD); Booleanquery.add (TermQuery2, BooleanClause.Occur.SHOULD);

Iii. wildcardquery : A wildcard query

If you want to make a wildcard query for a word, you can use the Wildcardquery, which includes the '? ' character. Match an arbitrary character and ' * ' to match 0 or more arbitrary characters, for example you search ' use* ', you may find ' useful ' or ' useless ':

Query query = new Wildcardquery ("Content", "use*");

iv. phrasequery : query for words appearing within the specified text distance

You may be more interested in Sino-Japanese relations, and want to find the ' middle ' and ' Day ' close (5 words within the distance) of the article, beyond the distance of the non-consideration, you can:

Phrasequery query = new Phrasequery ();

Query.setslop (5);

Query.add (New term ("content", "medium"));

Query.add (New term ("content", "Day"));

Then it may search for "Sino-Japanese cooperation ...", "China and Japan ...", but not "a senior Chinese leader said that Japan owes a flat".

V. prefixquery : The query word starts with a character

If you want to search for words beginning with ' Medium ', you can use Prefixquery:

Prefixquery query = new Prefixquery (New term ("content", "medium");

Vi. fuzzyquery : Similar search

Fuzzyquery is used to search for similar terms using the Levenshtein algorithm. Suppose you want to search for words similar to ' Wuzza ', you can:

Query query = new Fuzzyquery ("Content", "Wuzza");

You may get ' fuzzy ' and ' Wuzzy '.

Vii. termrangequery : in-scope search

You may want to search for document from 20060101 to 20060130 in the time domain, which you can use Termrangequery:

Termrangequery Query2 = Termrangequery.newstringrange ("Time", "20060101", "20060130", true, true);

The last true indicates the use of a closed interval.

viii.

h) Topdocs : Result set, is the result of searcher search, inside is some scoredoc, this object's DOC member is this ID!

To get the article, then you need to use this ID to fetch the article, Searcher provides the ID to obtain the document method, and then took the data

    //Cook looked through analyzerAnalyzer Analyzer =NewPaodinganalyzer ();list<article> ResU = DOCRECORDSERVICE.WX (0);//Fetch meta Data//Create directoryDirectory dir = Fsdirectory.open (NewFile (Indexdir)); //Create writerIndexwriterconfig IWC =NewIndexwriterconfig (version.lucene_44, analyzer); Iwc.setopenmode (IndexWriterConfig.OpenMode.CREATE_OR_APPEND);//How to set up index maintenanceIndexWriter writer =NewIndexWriter (dir, IWC); Try {                 for(inti =0; I < resu.size (); i++) {                        //document entity classDocrecord Drec =NewDocrecord (); //Create a documentDocument document =NewDocument (); Drec.setfilename (ResU.Get(i). Getpid ()); Drec.setdoctype ("Weixin");                        Drec.setlastmodify (System.currenttimemillis ()); Drec.settitle (ResU.Get(i). GetTitle ()); Try{System. out. println ("Start Word breaker:------>"+i); LongTime = ResU.Get(i). Getpublishtimemillis (); if((time+""). Length () < One) { time= time* +; } String Tag= ResU.Get(i). Getarticletag (); //Get time stampDateTime = ResU.Get(i). Getpublishtimemillis (); FieldType Testtype=NewFieldType (); Testtype.setindexed (false); Testtype.settokenized (false); Testtype.setstored (true); //Create article title, tags, introduction, category, time, cover map address, original addressDocument.add (NewStringfield ("FileName", ResU.Get(i). Getpid (), Field.Store.YES)); Document.add (NewTextField ("Abstracts", ResU.Get(i). getabstracts () = =NULL?"": ResU.Get(i). Getabstracts (), Field.Store.YES)); if(NULL!=resu.Get(i). Getpublisher () &&!"". Equals (ResU.Get(i). Getpublisher ())) {Document.add (NewStringfield ("author", ResU.Get(i). Getpublisher (), Field.Store.YES)); }Else{Document.add (NewStringfield ("author", ResU.Get(i). Getauthor (), Field.Store.YES)); } document.add (NewStringfield ("Categorycode", ResU.Get(i). Getcategorycode (), Field.Store.YES)); Document.add (NewStringfield ("CategoryName", ResU.Get(i). Getcategoryname () = =NULL?"": ResU.Get(i). Getcategoryname (), Field.Store.YES)); Document.add (NewField ("IMAGEURL", ResU.Get(i). Getthemeimageurls (), testtype)); Document.add (NewField ("sourceURL", ResU.Get(i). Getsourceurl (), testtype)); Document.add (NewStringfield ("Iscopyright", ResU.Get(i). Getiscopyright (), Field.Store.YES)); Document.add (NewStringfield ("Source", ResU.Get(i). GetSource (), Field.Store.YES)); Document.add (NewTextField ("title", ResU.Get(i). GetTitle (), Field.Store.YES)); Document.add (NewLongfield ("Date", Time, Field.Store.YES)); The yes here means you can make an index searchif(NULL!=resu.Get(i). GETSN () &&!"". Equals (ResU.Get(i). GETSN ())) {Document.add (NewStringfield ("SN", ResU.Get(i). GETSN (), Field.Store.YES)); }                            if(NULL!=tag &&!"". Equals (TAG)) {Document.add (NewStringfield ("Tag", Tag, Field.Store.YES)); }                            Writer.updatedocument (NewTerm ("FileName", (ResU.Get(i). Getpid ())! =NULL? (ResU.Get(i). Getpid ()):""), document); //Update the Database is_index field to indicate that the data has been indexed//int id = docrecordservice.createdoc (drec);System. out. println ("Complete Word breaker:------>"+i); } Catch(Exception e) {e.printstacktrace ();                }} writer.close ();            Dir.close (); } Catch(Exception e) {e.printstacktrace (); }

Share some of the insights about Lucene

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.