Lucene learning-index creation and search

Last Update:2018-12-03 Source: Internet

Author: User

Tags createindex

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, create the folder indexdocs and three TXT files in the path E: \ testlucene \ workspacese: l1.txt,l2.txt,l3.txt.

L1.txt content:

111111111111111111111111111111111111111111111111111111111111111111111111111
Information retrieval is to find information related to user requirements from the information set.
In addition to text, the retrieved information also includes multimedia information such as images, audios, and videos. Here we mainly talk about text information retrieval.
Full-text search: compares a user's query request with each word in the full text, regardless of the semantic match between the query request and the text,
In Information Retrieval Tools, full-text retrieval is the most universal and practical. (In general, the keyword is matched)

Data Retrieval: both query requirements and data in the information system follow a certain format and have a certain structure,
Allows you to search for specific fields. Its performance and usage are limited and semantic matching is supported.

Knowledge Retrieval: It emphasizes knowledge-based and semantic matching (the most complex, it is equivalent to knowing the answer to the search question,
Then directly search for the answer information ).

Full-text search refers to a computer indexing program that builds an index for each word by scanning every word in the document,
Specifies the number and location of the word in the article. When a user queries the word, the search program searches for the word based on the pre-created index, then, the search results are fed back to the user's retrieval method.

Data Retrieval and query requirements and data in the Information System must follow a certain format and have a certain structure, allowing specific fields to be searched.
For example, the data is stored in the form of "time, character, location, and event", and the query can be: Location = "Beijing ". The performance of data retrieval depends on the method used to identify the field and the user's understanding of this method, so it has great limitations.

L2.txt content:

2222222222222222222222222222222222222222222222222222222222222222222222222
Note: software that collects information on the internet is called crawlers or spider or Web Robots (things outside the search engine ),
Crawlers access each web page on the Internet. Each time a Web page is accessed, the content is sent back to the local server.
The most important task of information processing is to orchestrate indexes for locally collected information and prepare for queries.
The function of the word divider: it is used to split text resources and divide the text into the smallest unit (keyword) for indexing by rule)

L3.txt content:

333333333333333333333333333333333333333333333333333333333333333333333333
Chinese Word Segmentation: Chinese Word Segmentation is complicated because it is neither a word nor a word,
In addition, a word may not be a word in another place, for example, in "hats and clothes,
"Kimono" is not a word. There are three methods for Chinese Word Segmentation: Word Segmentation, binary segmentation, and dictionary word segmentation.
Word Segmentation: it refers to Word Segmentation Based on a Chinese character.
Bipartite Word Segmentation: split by two words
Dictionary word segmentation: Constructs words based on a certain algorithm, and then matches the dictionary set that has been created. If a word is matched, it is split into words,

Now that the preparation is complete, check the Code:

File2document. Java

Package Lucene. study; import Java. io. bufferedreader; import Java. io. file; import Java. io. fileinputstream; import Java. io. filenotfoundexception; import Java. io. ioexception; import Java. io. inputstreamreader; import Java. io. unsupportedencodingexception; import org.apache.w.e.doc ument. document; import org.apache.e.doc ument. field; import org.apache.e.doc ument. field. index; import org.apache.w.e.doc u Ment. field. store;/*** @ author xudongwang 2012-2-2 ** Email: xdwangiflytek@gmail.com */public class file2document {/* file ---> document ** @ Param filepath * file path ** @ return Document Object */public static document file2document (string filepath) {// file to be stored: name, content, size, pathfile file = new file (filepath); document = new document (); // store. yes whether to store Yes No compress (compressed and then stored) // whether the index is indexed Index. analyzed is indexed after word segmentation. not_analyzed is not indexed. not_analyzed // directly indexes document without word segmentation. add (new field ("name", file. getname (), field. store. yes, field. index. analyzed); document. add (new field ("content", readfilecontent (file), field. store. yes, field. index. analyzed); document. add (new field ("size", String. valueof (file. length (), field. store. yes, field. index. not_analyzed); // No word segmentation, but sometimes the index is required. The file size (INT) is converted to stringdocument. add (New Field ("path", file. getabsolutepath (), field. store. yes, field. index. not_analyzed); // return document does not need to be queried Based on the file path;}/*** 49. * Read File Content 50.*51. * @ Param file 52. * file object 53. * @ return file content * 54. */Private Static string readfilecontent (File file) {try {bufferedreader reader = new bufferedreader (New inputstreamreader (New fileinputstream (File); stringbuffer content = new stringbuffer (); try {(String line = NULL; (line = reader. Readline ())! = NULL;) {content. append (line ). append ("\ n") ;}} catch (ioexception e) {e. printstacktrace ();} // try {// byte temp [] = content. tostring (). getbytes ("UTF-8"); // string TT = new string (temp, "gb2312"); // system. out. println (TT); //} catch (unsupportedencodingexception e) {// E. printstacktrace (); //} return content. tostring ();} catch (filenotfoundexception e) {e. printstacktrace ();} return NULL;}/*** <PRE> * two methods for obtaining the name attribute value * 1. filed field = document. getfiled ("name"); * field. stringvalue (); * 2.doc ument. get ("name"); * </PRE> ** @ Param document */public static void printdocumentinfo (document) {// todo auto-generated method stubsystem. out. println ("index name -->" + document. get ("name"); // system. out. println ("content -->" + document. get ("content"); system. out. println ("index path -->" + document. get ("path"); system. out. println ("index size -->" + document. get ("size "));}}

Firstlucene. Java

Package Lucene. study; import Java. io. file; import Org. apache. lucene. analysis. analyzer; import Org. apache. lucene. analysis. standard. standardanalyzer; import org.apache.e.doc ument. document; import org.apache.e.doc ument. field; import Org. apache. lucene. index. indexreader; import Org. apache. lucene. index. indexwriter; import Org. apache. lucene. index. indexwriterconfig; import Org. apache. lucene. index. indexwri Terconfig. openmode; import Org. apache. lucene. queryparser. multifieldqueryparser; import Org. apache. lucene. queryparser. queryparser; import Org. apache. lucene. search. filter; import Org. apache. lucene. search. indexsearcher; import Org. apache. lucene. search. query; import Org. apache. lucene. search. scoredoc; import Org. apache. lucene. search. topdocs; import Org. apache. lucene. store. directory; import Org. apache. lucene. sto Re. fsdirectory; import Org. apache. lucene. store. ramdirectory; import Org. apache. lucene. util. version;/*** 23. * @ author xudongwang 2012-2-2 24.*25. * Email: xdwangiflytek@gmail.com * 26.e: \ testlucene \ workspacese \ indexdocs */public class firstlucene {/*** source file path */private string filepath01 = "E: \ testlucene \ workspacese \ l1.txt "; private string filepath02 =" E: \ testlucene \ workspacese \ l2.txt "; privat E string filepath03 = "E: \ testlucene \ workspacese \ l3.txt";/***** index path */private string indexpath = "E: \ testlucene \ workspacese \ indexdocs ";/*** tokenizer. Here we use the default tokenizer and the standard analyzer (several, but not good at Chinese) */private analyzer = new standardanalyzer (version. required e_35); Private ramdirectory = NULL;/*** create index ** @ throws exception */Public void createindex () throws exception {file indexfile = new file (Indexpath); directory = fsdirectory. Open (indexfile); // The writer configuration requires two parameters: version and word divider. There are other parameters, so I will not talk about them here. Indexwriterconfig conf = new indexwriterconfig (version. paie_35, analyzer); Conf. setopenmode (openmode. create); // The indexwriter index writer is used to operate (add, delete, modify) The indexwriter = new indexwriter (directory, conf) of the index Library; // two parameters are required, directory, writer configuration // document, that is, the document doc01 = file2document. file2document (filepath01); document doc02 = file2document. file2document (filepath02); document doc03 = file2document. file2document (filepath03 ); // Add document to indexwriter in the index library. adddocument (doc01); indexwriter. adddocument (doc02); indexwriter. adddocument (doc03); indexwriter. close (); // close the writer, release resources, and create indexes}/*** create a memory index ** @ throws exception **/Public void createramindex () throws exception {file indexfile = new file (indexpath); directory = fsdirectory. open (indexfile); // memory index // ramdirectory = new ramdirectory (); ramdirectory = ne W ramdirectory (directory); // The constructor with the parameter loads the physical index to the memory. // The writer configuration requires two parameters: version and word divider. There are other parameters, so I will not talk about them here. Indexwriterconfig ramconf = new indexwriterconfig (version. paie_35, analyzer); // ramconf. setopenmode (openmode. create); // indexwriter the index writer is used to operate (add, delete, and modify) The indexwriter ramindexwriter = new indexwriter (ramdirectory, ramconf) of the index Library; // two parameters are required, directory, writer configuration // document, that is, the document doc01 = file2document. file2document (filepath01); document doc02 = file2document. file2document (filepath02); document doc03 = file2document. file2do Cument (filepath03); // Add the document to the index library ramindexwriter. adddocument (doc01); ramindexwriter. adddocument (doc02); ramindexwriter. adddocument (doc03); ramindexwriter. close (); // close the writer, release resources, and create indexes. // merge the indexes in the memory with the physical indexes to merge the indexes. Indexwriterconfig fsconf = new indexwriterconfig (version. paie_35, analyzer); // fsconf. setopenmode (openmode. create_or_append); indexwriter fsindexwriter = new indexwriter (directory, fsconf); // merge all the index data from other index libraries into the current index library fsindexwriter. addindexes (ramdirectory); fsindexwriter. close ();}/*** search ** @ Param querystr * search keyword * @ throws exception */Public void search (string querystr) throws exception {// 1, Parse the text to be searched into a query object // specify the fields in which the query string [] fields = {"name", "content"}; // queryparser: is a tool for parsing user input. You can generate a query object by scanning user input strings. Queryparser = new multifieldqueryparser (version. paie_35, fields, analyzer); // query: fuzzy query, semantic query, phrase query, and combined query are supported in Lucene, such as termquery, booleanquery, rangequery, wildcardquery and other classes. Query query = queryparser. parse (querystr); // 2. Query file indexfile = new file (indexpath); // indexsearcher is used to Query Directory directory = fsdirectory in the index library. open (indexfile); indexreader = indexreader. open (directory); indexsearcher = new indexsearcher (indexreader); // filter, We can filter the results, you can block some content that you do not want to see. Filter filter = NULL; // 10000 indicates how many documents are queried in the database at a time. // topdocs is similar to the set topdocs = indexsearcher. search (query, filter, 10000); system. out. println ("total [" + topdocs. totalhits + "] documents contain matching results of \" "+ querystr +" \ "); // note that the matching results here refer to the number of documents, instead of containing the number of search results in the Document. // 3. Print the result for (scoredoc: topdocs. scoredocs) {int docsn = scoredoc.doc; // document internal number document = indexsearcher.doc (docsn); // retrieve the corresponding file file2document according to the document number. printdocumentinfo (document); // Print Document Information} public static void main (string [] ARGs) throws exception {firstlucene Lucene = new firstlucene (); Lucene. createindex (); // create an index // Lucene. createramindex (); // create a memory index Lucene. search ("Word Segmentation"); // search for the text system you are looking. out. println ("---------------------------"); Lucene. search ("Search"); system. out. println ("---------------------------"); Lucene. search ("Index"); system. out. println ("---------------------------");}}

Console printing:

A total of [3] documents contain results index name matching "Word Segmentation" --> l3.txt index path --> E: \ testlucene \ workspacese \ l3.txt index size --> 619 index name --> l2.txt index path --> E: \ testlucene \ workspacese \ l2.txt index size --> 561 index name --> l1.txt index path --> E: \ testlucene \ workspacese \ l1.txt index size --> 1636 ----------------------------- a total of [2] documents contain results index name matching "Search" --> l1.txt index path --> E: \ testlucene \ workspacese \ l1.txt index size --> 1636 index name --> l2.txt index path --> E: \ testlucene \ workspacese \ l2.txt index size --> 561 --------------------------- a total of [2] documents contain results index name matching "Index" --> l2.txt index path --> E: \ testlucene \ workspacese \ l2.txt index size --> 561 index name --> l1.txt index path --> E: \ testlucene \ workspacese \ l1.txt index size --> 1636 ---------------------------

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More