How to read Lucene index data

Source: Internet
Author: User

Original: http://lqgao.spaces.live.com/blog/cns!3BB36966ED98D3E5!408.entry?_c11_blogpart_blogpart=blogview&_c= Blogpart#permalink

This article describes how to use Indexreader to get information. Why do you want to read the index? Because I need to implement these features:
(1) Statistical term in the entire collection (document frequency, DF);
(2) Statistical term of the word occurrences in the whole collection (term frequency in whole collection);
(3) Statistical term of frequency (term frequency, TF) appearing in a document;
(4) List where term appears in a document (position);
(5) The number of Chinese files in the whole collection;


So why use these data? These data are the necessary "raw materials" for TR (text retrieval, text retrieval), and are processed. Before retrieving, only the original text (raw data), after the processing of the indexer (indexer), the original text becomes one term (or token) and then indexer recorded in the location and the number of occurrences. With these data, some models can be used to realize the function of search engine-text retrieval.


Smart readers, you might say, it seems like a good thing to do, but count (count). Yes, it's counting, or statistics. But seemingly simple processes, if combined with space (memory capacity) constraints, it is not so simple. Assuming that if each document has 100 term, each term needs to store 10 bytes of information, and 1,000,000 documents need 10x100x10^6=10^9=2^30 bytes, that is, 1GB. Although now 1G memory is nothing, but can not always put 1GB of data into memory all the time. Then put the hard drive well, now need to use the data, then the 1GB data from the hard drive to memory. OK, you can go for a cup of coffee first, and come back to the following operation. This is 1,000,000 of the document, if more, now there is no auxiliary data structure of the way, will lead to very poor efficiency.

Lucene's index divides the data into segments and reads it when needed, leaving the data obediently on the hard drive when it is not needed. Lucene itself is an excellent indexing engine that provides efficient indexing and retrieval mechanisms. The purpose of the text is to introduce how to read the required information from the data of the already built index, such as using the Lucene API. As for how Lucene is used, I will introduce it in subsequent articles.

Let's take a step at a look. Here the construction has been implemented to build a good index, stored in the index directory. Well, to read the index, it must be a read indexer (that is, an instance of Indexreader in Lucene). OK, write the following program (Program for C # program, this article uses Dotlucene).
Indexreader reader;
The problem is, Indexreader is an abstract class and cannot be instantiated. Well, try the derivation class. Find Indexreader's two children--segmentreader and Multireader. with which. No matter which one of them needs a lot of parameters (I was quite a lot of trouble to figure out what they were used for, later on), it seems that it is not so easy to use Lucene's indexed data. By tracking the code and consulting the documentation, I finally found the key to using Indexreader. Originally Indexreader has a "Factory mode" of the static Interface--indexreader.open. The definition is as follows:
#0001 public static Indexreader Open (System.String path)
#0002 public static Indexreader Open (System.IO.FileInfo path)
#0003 public static Indexreader Open (Directory directory)
#0004 private static Indexreader Open (directory directory, bool closedirectory)
Three of these are public interfaces that are available for invocation. Opening an index is as simple as this:
#0001 Indexreader reader = Indexreader.open (index);
In fact, this open index has gone through one of these processes:
#0001 Segmentinfos infos = new Segmentinfos ();
#0002 Directory directory = fsdirectory.getdirectory (index, false);
#0003 infos. Read (directory);
#0004 bool closedirectory = false;
#0005 if (infos. Count = 1)
#0006 {
#0007//Index is optimized
#0008 return to New Segmentreader (Infos, infos. Info (0), closedirectory);
#0009}
#0010 Else
#0011 {
#0012 indexreader[] readers = new Indexreader[infos. Count];
#0013 for (int i = 0; i < infos. Count; i++)
#0014 readers = new Segmentreader (infos. Info (i));
#0015 return to New Multireader (directory, infos, closedirectory, readers);
#0016}
First read the section information of the index (segment information, #0001 ~ #0003), and then look at a few paragraphs: if there is only one, it may be optimized, read this segment directly (#0008), or you will need to read the segments at once (#0013 ~# 0014), and then spell a multireader (#0015). This is the process of opening the index file.

Next we'll look at how to read the information. Use the following code to illustrate.
#0001 public static void Printindex (Indexreader reader)
#0002 {
#0003//show how many document
#0004 System.Console.WriteLine (reader + "/tnumdocs =" + Reader. Numdocs ());
#0005 for (int i = 0; i < reader. Numdocs (); i++)
#0006 {
#0007 System.Console.WriteLine (reader. Document (i));
#0008}
#0009
#0010//enum term, get <document, term freq, position* > Info
#0011 termenum termenum = reader. Terms ();
#0012 while (Termenum.next ())
#0013 {
#0014 System.Console.Write (Termenum.term ());
#0015 System.Console.WriteLine ("/tdocfreq=" + termenum.docfreq ());
#0016
#0017 termpositions termpositions = reader. Termpositions (Termenum.term ());
#0018 int i = 0;
#0019 int j = 0;
#0020 while (Termpositions.next ())
#0021 {
#0022 System.Console.WriteLine ((i++) + "->" + "DOCNO:" + termpositions.doc () + ", Freq:" + termpositions. Freq ());
#0023 for (j = 0; J < Termpositions.freq ();
#0024 System.Console.Write ("[" + termpositions.nextposition () + "]");
#0025 System.Console.WriteLine ();
#0026}
#0027
#0028//Direct access to <term freq, document> information
#0029 termdocs termdocs = reader. Termdocs (Termenum.term ());
#0030 while (Termdocs.next ())
#0031 {
#0032 System.Console.WriteLine ((i++) + "->" + "DOCNO:" + termdocs.doc () + ", Freq:" + termdocs.freq ());
#0033}
#0034}
#0035
#0036//Fieldinfos Fieldinfos = Reader.fieldinfos;
#0037//FieldInfo Pathfieldinfo = Fieldinfos.fieldinfo ("path");
#0038
#0039//Display term frequency vector
#0040 for (int i = 0; i < reader. Numdocs (); i++)
#0041 {
The term of the token after the #0042//to contents was in Termfreqvector
#0043 termfreqvector termfreqvector = reader. Gettermfreqvector (i, "contents");
#0044
#0045 if (termfreqvector = null)
#0046 {
#0047 System.Console.WriteLine ("Termfreqvector is null.");
#0048 continue;
#0049}
#0050
#0051 String fieldName = Termfreqvector.getfield ();
#0052 string[] terms = termfreqvector.getterms ();
#0053 int[] frequences = Termfreqvector.gettermfrequencies ();
#0054
#0055 System.Console.Write ("FieldName:" + FieldName);
#0056 for (int j = 0; J < terms. Length; J + +)
#0057 {
#0058 System.Console.Write ("[" + Terms[j] + ":" + frequences[j] + "]";
#0059}
#0060 System.Console.WriteLine ();
#0061}
#0062 System.Console.WriteLine ();
#0063}
#0004 calculate the number of document
#0012 ~ #0034 Enumerate all the term in the collection
Where #0017~ #0026 Enumerate all the positions of each term in the document that appears (the first word, counting from 1); #0029 ~ #0033 calculates which documents each term appears in and the corresponding occurrence frequency (i.e. DF and TF).
#0036 ~ #0037在reader是SegmentReader类型的情况下有效.
#0040 ~ #0061可以快速的读取某篇文档中出现的term和相应的频度. However, this section requires that the Storetermvector be set to True when the index is being built. Like what
Doc. ADD (Field.text ("Contents", reader, true);
The third of these is. The default is False.

With this data, I can count the data I need. Later I'll explain how to build an index and how to apply Lucene.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.