Lucene initial test-some experiences in Indexing large texts, Chinese garbled characters, and queryparser Retrieval

Source: Internet
Author: User

Because Lucene was used in a small project over the past few days, I learned a little about it. Now I still don't know much about it. I will summarize my problems first.

I. Indexing of large text

The large text I mentioned here is actually about a TXT of about mb. It may not be a big text. However, when I create an index of about MB, it does cause memory overflow and an error occurs in Java. lang. outofmemoryerror: Java heap space. I checked it online for a long time and tried some methods, such as modifying JVM running parameters. The tested machine is I5 quad-core and 4 GB memory. The tested memory is more than 1 GB. Shouldn't MB of text be accepted? However, memory overflow occurs. Without learning about Lucene's mechanism, I came up with the following solutions: cutting text, preprocessing large text first, and dividing it into small text, the second is to create an index for large text segments, such as writing 50 MB to a disk, and the third is to create an index by row, for example, reading 1 million rows at a time. In fact, I think these internal principles are a little bit of text splitting, but the implementation method is slightly different.

In the first case, I didn't test it. I thought it was too troublesome. I had to write an independent program to cut the text. The second method is my code logic. After reading a certain amount of data, I create a Document Object and set setmaxbuffereddocs (N). I saved a document at 10 MB, then setmaxbuffereddocs (5), according to Lucene's official documentation,When the cache in the memory reaches the specified size (200 MB) or the number of DOC files reaches the specified size (5 in this example, this will trigger a write operation to the disk.. In my opinion, at most 5x10 = 50 m in the memory, which means I/O is too frequent to reduce the speed, at least I have to be able to run. However, during the actual test, it was found that a text (170 m) index was not created for a long time. According to my understanding, m would be enough for 3 or 4 write operations, but the fact is that the results have not been returned for a long time. So I gave up this idea and didn't go into details.

Later, I tested it in the third way, which is feasible and efficient. The idea is to create a doc for every 1 W lines of reading. My text is special, and the 2 W lines are about 1 m, and then set setmaxbuffereddocs (100. The specific value settings can be viewed in my own environment. I will not post the code, that is, read the text cyclically by line, and create a document object after times.

Later, due to business needs, I changed to one line to create a document, and then setmaxbuffereddocs (2 W), the actual measurement efficiency is also acceptable. Because each row of information is required for retrieval, it can only be read by row. For example, if my row of data is "1052307934 ---- Jun jun7089059 ---- 73.63.134.205 ---- adsl---- July 10, 2011, Chuzhou Telecom, Anhui province, 14:28:24", I search for "Chuzhou city, Anhui Province". If this record is found, so I need all the information of this line. If I don't press one document for each line, for example, I have ten lines of documents, so there is too much noise data. I wonder if Lucene provides the function of extracting only the information I need? Otherwise, I have to write it myself and find what I want from 10 rows, which makes little sense.

Ii. Chinese garbled characters

Chinese garbled characters are everywhere. If the file encoding, system encoding, and runtime environment encoding are the same, there should be no gibberish issues. If you know the file encoding

reader = new BufferedReader(new InputStreamReader(new FileInputStream(file), “gbk”));
Setting like this can solve the problem, but it is troublesome because the source files are GBK and UTF-8. So we have to dynamically identify the encoding of the current file. I checked it and it seems that it is not so easy to identify the file encoding. Many examples provided on the Internet are not very accurate or stable. Fortunately, I found one, which is useful for my TXT, the Code is as follows: Share (non-original ):
/*** Query character encoding * @ Param filename * @ return UTF-8/Unicode/UTF-16BE/GBK * @ throws exception */public static string codestringplus (string filename) throws exception {bufferedinputstream bin = NULL; string code = NULL; try {bin = new bufferedinputstream (New fileinputstream (filename); int P = (bin. read () <8) + bin. read (); Switch (p) {Case 0 xefbb: code = "UTF-8"; break; Case 0 xfffe: code = "Unicode"; break; Case 0 xfeff: code = "UTF-16BE"; break; default: code = "GBK" ;}} catch (exception e) {e. printstacktrace ();} finally {bin. close ();} return code ;}
In this way, my encoding problem is solved. Thanks to this contributor, although there may be some defects, it is enough for me.

Iii. Knowledge about Retrieval

I have been confused about one problem. The scenario is as follows,

For example, the text "Hello, I am Chinese ",

After the index is created, I can query "hello" and "Chinese", but why can't I find "chine? Am I wrong?

Later I figured it out, because I also used analyzer during the query. According to my understanding, I think Lucene is such a process. When creating an index, we first split the word into words, the above text will be parsed into "hello", "I", "am", and "Chinese" (of course, I and am may be removed by the parser, which is of little value, suppose they make sense ).

Then, I used queryparser to add analyzer to the query. That is to say, when I entered Chinese, I first went through the word segmentation process and resolved my keywords to "Chinese ", then you can search for it. When I input "chine", the parser will only parse it into "chine", which is not in the index! Of course not found. If you want to input a Chine, you may need to perform additional operations, such as changing the query method, but I have not done it, which causes these confusions.

4. Search other formats of Text

Lucene does not care about the file format of the source file. That is to say, you have to convert the documents in different formats into plain text. Instead, you need to write the parser in different formats instead of using it to create an index.

The above is my current understanding of Lucene. I don't know if it's all right. It's for reference only. I hope my friends who have read this article can give me some suggestions and learn from each other.

5. The following code is attached:

In Windows, the environment is ipve4.10.0 + myeclipse2013 + jdk1.7, 4 gram.

1. Create an index (txt is used here)

/*** Deleetest * COM. lucene. sheen. mine */package COM. lucene. sheen. mine; import Java. io. bufferedinputstream; import Java. io. bufferedreader; import Java. io. file; import Java. io. fileinputstream; import Java. io. inputstreamreader; import Java. lang. management. managementfactory; import Java. lang. management. memorymxbean; import Java. lang. management. memoryusage; import Java. util. date; import Org. apache. lucene. analy Sis. analyzer; import org.apache.e.doc ument. document; import org.apache.e.doc ument. field; import org.apache.e.doc ument. field. store; import org.apache.e.doc ument. stringfield; import org.apache.w.e.doc ument. textfield; import Org. apache. lucene. index. indexwriter; import Org. apache. lucene. index. indexwriterconfig; import Org. apache. lucene. index. indexwriterconfig. openmode; import Org. apache. Lucene. store. directory; import Org. apache. lucene. store. fsdirectory; import Org. apache. lucene. util. version; import COM. chenlb. mmseg4j. analysis. mmseganalyzer;/*** @ author sheen 2014-9-10 ***/public class myindex {/*** @ Param ARGs * @ throws exception */public static void main (string [] ARGs) throws exception {string docpath = "resource \ data"; string indexpath = "resource \ Index"; file docfile = new file (Docpath); If (! Docfile. exists () |! Docfile. Canread () {system. Out. println ("the folder you selected does not exist or you are not authorized to access it! File Path: "+ docfile. getabsolutepath (); system. exit (1);} date start = new date (); directory indexdir = fsdirectory. open (new file (indexpath); analyzer = new mmseganalyzer (); indexwriterconfig IWC = new indexwriterconfig (version. lucene_4_10_0, analyzer); ADH. setrambuffersizemb (200 ). setmaxbuffereddocs (20000); IWC. setopenmode (openmode. create); indexwriter writer = new indexwriter (indexdir, IWC); memorymxbe An memorymbean = managementfactory. getmemorymxbean (); memoryusage usage = memorymbean. getheapmemoryusage (); system. out. println ("init heap:" + usage. getinit (); system. out. println ("Max heap:" + usage. getmax (); system. out. println ("use heap:" + usage. getused (); indexdoc (writer, docfile); writer. close (); date end = new date (); seevmstatus (); system. out. println ("all files have been indexed, time consumed:" + (double) (end. gettime ()-S Tart. gettime ()/(1000*60) + "min");} static void indexdoc (indexwriter writer, file) throws exception {If (file. canread () {If (file. isdirectory () {file [] files = file. listfiles (); For (File thisfile: Files) {indexdoc (writer, thisfile) ;}} else {string code = codestring (file. getabsolutepath (); system. out. println ("*********** file:" + file. getabsolutepath () + "creating index ***********************"); system. O Ut. println ("character encoding:" + code); seevmstatus (); bufferedreader reader = NULL; try {field pathfield = new stringfield ("path", file. getpath (), field. store. yes); reader = new bufferedreader (New inputstreamreader (New fileinputstream (file), Code); string line = NULL; long filesize = 0; while (line = reader. readline ())! = NULL) {filesize + = line. getbytes (). length; document DOC = new document (); Doc. add (pathfield); field textfield = new textfield ("contents", line, store. yes); Doc. add (textfield); writer. adddocument (DOC);} system. out. println ("totalsize:" + filesize/(1024*1024) + "M"); system. out. println ("Index created \ n");} catch (exception e) {e. printstacktrace ();} finally {reader. close () ;}}}/*** view VM memory information */public static void seevmstatus () {memorymxbean memorymbean = managementfactory. getmemorymxbean (); system. out. println ("JVM full information:"); system. out. println ("heap memory usage:" + memorymbean. getheapmemoryusage (); system. out. println ("non-heap memory usage:" + memorymbean. getnonheapmemoryusage ());} /*** query character encoding * @ Param filename * @ return UTF-8/Unicode/UTF-16BE/GBK * @ throws exception */public static string codestringplus (string filename) throws exception {bufferedinputstream bin = NULL; string code = NULL; try {bin = new bufferedinputstream (New fileinputstream (filename); int P = (bin. read () <8) + bin. read (); Switch (p) {Case 0 xefbb: code = "UTF-8"; break; Case 0 xfffe: code = "Unicode"; break; Case 0 xfeff: code = "UTF-16BE"; break; default: code = "GBK" ;}} catch (exception e) {e. printstacktrace ();} finally {bin. close ();} return code ;} /*** query whether the character encoding is UTF-8 or GBK * @ Param filename * @ return UTF-8/GBK * @ throws exception */public static string codestring (string filename) throws exception {bufferedinputstream bin = NULL; string code = NULL; try {bin = new bufferedinputstream (New fileinputstream (filename); int P = (bin. read () <8) + bin. read (); Switch (p) {Case 0 xefbb: code = "UTF-8"; break; default: code = "GBK" ;}} catch (exception e) {e. printstacktrace ();} finally {bin. close () ;}return code ;}}
2. Query

/*** Deleetest * COM. lucene. sheen. mine */package COM. lucene. sheen. mine; import Java. io. file; import Java. io. ioexception; import Java. util. date; import Org. apache. lucene. analysis. analyzer; import org.apache.e.doc ument. document; import Org. apache. lucene. index. directoryreader; import Org. apache. lucene. index. indexreader; import Org. apache. lucene. queryparser. classic. parseexception; import Org. apache. lucene. queryparser. classic. queryparser; import Org. apache. lucene. search. indexsearcher; import Org. apache. lucene. search. query; import Org. apache. lucene. search. scoredoc; import Org. apache. lucene. search. topdocs; import Org. apache. lucene. store. fsdirectory; import COM. chenlb. mmseg4j. analysis. mmseganalyzer; /*** @ author sheen 9-9-10 ***/public class mysearcher {/*** @ Param ARGs * @ throws ioexception * @ throws parseexception */public static void main (string [] ARGs) throws ioexception, parseexception {string Index = "resource \ Index"; string field = "contents"; string querystring = "870270291"; indexreader reader = directoryreader. open (fsdirectory. open (new file (INDEX); indexsearcher searcher = new indexsearcher (Reader); analyzer = new mmseganalyzer (); queryparser parser = new queryparser (field, analyzer ); query query = parser. parse (querystring); system. out. println ("query Keyword:" + query. tostring (); Date start = new date (); topdocs Results = searcher. search (query, 20); scoredoc [] hits = results. scoredocs; For (scoredoc SDoC: HITS) {document DOC = searcher.doc(sdoc.doc); system. out. println ("query result:"); system. out. println (SDoC. score); system. out. println (Doc. get ("path"); system. out. println (new string (Doc. get ("contents "). getbytes (), "UTF-8");} date end = new date (); system. out. println ("Time consumed:" + (end. gettime ()-start. gettime ()));}}




Lucene initial test-some experiences in Indexing large texts, Chinese garbled characters, and queryparser Retrieval

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.