Inside Lucene/popular search engine Learning (1)-query Mechanism

Source: Internet
Author: User

Next section of the previous section

Searching with termquery query Mechanism

RenWhich users, including system developers, use only one common search engine method: query ). the purpose of the entire search process is to meet the query requirements. The search process runs through the query. if the query is not specified, but starts from the index content, "Search" is a non-objective and meaningless process. the start point of the search process can only be an index.

ToSee how query is used to start a query.


public class BasicSearchingTest extends LiaTestCase {
public void testTerm() throws Exception {
IndexSearcher searcher = new IndexSearcher(directory);
Term t = new Term(“subject”, “ant”);
Query query = new TermQuery(t);
Hits hits = searcher.search(query);

assertEquals(“JDwA”, 1, hits.length());

t = new Term(“subject”, “junit”);
hits = searcher.search(new TermQuery(t));

assertEquals(2, hits.length());

searcher.close();

}
}


QuiltLucene, a classic example of object-oriented design, is a mature design brought by the OOA/OOD method. Its architecture does reflect the "query" behavior in the real world model. this also gave me ample admiration for the software design capabilities of Lucene designers. I want to know that people are studying algorithms. the idea of people engaged in academic and engineering is extremely different. When I was working at Microsoft, I had a headache for the code difference between the Institute and the Engineering Institute. lucene designers can set algorithms and code so perfectly that they can be said to be cool.

MeWhat really cares about is how search algorithms export query results based on queries. the above Code gives me some inspiration, I know the starting point, from searcher. at the beginning of query, I can step by step understand the function of query in the search process.
To meet the real-world Semantics, Lucene provides a large number of queries. the termquery in the code above is the simplest query. Many daily searches are directly or in combination with termquery. the structure of search (termquery) is the most basic, without tedious conversion. so I started from termquery and checked the core mechanism of search step by step.

IntoBefore I enter the search method, I understand these constraints: In the semantics of termquery, each term refers to a (field, keyword) pair. The query conditions described by each term are: "in the specified field (title, author, content ...) the specified keyword "; the advanced search method can process custom filters and sort. Here, the test object is the simplest Search of these custom options.

YesThe termquery Query Process, the object I have investigated, has been completely simplified. but he and the more complex method of overloading, the developers applied the same idea, we can expect to do the opposite.

public class IndexSearcher extends Searcher {
public TopDocs search(Query query, Filter filter, final int nDocs)
throws IOException {
Scorer scorer = query.weight(this).scorer(reader);
if (scorer == null)
return new TopDocs(0, new ScoreDoc[0]);

final BitSet bits = filter!=null?filter.bits(reader):null;
final HitQueue hq = new HitQueue(nDocs);
final int[] totalHits = new int[1];
scorer.score(new HitCollector() {
public final void collect(int doc, float score) {
if (score > 0.0f &&
(bits==null || bits.get(doc))) {
totalHits[0]++;
hq.insert(new ScoreDoc(doc, score));
}
}
});


ScoreDoc[] scoreDocs = new ScoreDoc[hq.size()];
for (int i = hq.size()-1; i >= 0; i--)
scoreDocs[i] = (ScoreDoc)hq.pop();

return new TopDocs(totalHits[0], scoreDocs);
}
...
}


The main process is: 1. Get the scorer object

 Scorer scorer = termQuery.weight.scorer(indexReader)


2. Call the score (collector) method of this scorer object (termscorer.

 scorer.score(new HitCollector() {
...
});


After these two steps, collector fills up HQ with the query results, and the user gets the results one by one from HQ. in this Code, indexreader is used to read index data for query. scorer is responsible for filling up collector with query results. the question is, where does the 'query result' of scorer come from? If indexreader provides data to scorer, how does the data content be selected from the index file?

SCorer uses an anonymous collector class to collect the doc that meets termquery, but how does scorer know which documents comply with the query? Real queries are not performed in the score () method. From the perspective of data extraction, modern search engines extract a list of documents that meet the term from inverted index. I personally know what is in inverted index. All the documents obtained by the specified keyword are completed using the inverted index data structure, this is also the core of the search process-an index method that has lasted for hundreds of years. in this new OO search engine framework, WHO (which object/class) is responsible for extracting the records corresponding to the term, how does he deliver the results to scorer?

ReadTaking inverted index is the responsibility of indexreader, which has already been mentioned. the knowledge of this point comes from the tips in the Lucene manual and a large amount of code reading. I am not entangled in this issue. What I am concerned about now is how scorer and indexreader interact? Recall the parameters used to create scorer for termquery in the code above. It is exactly the indexreader object. It can be guessed that termquery uses the "Create" privilege to secretly manipulate score. now let's take a look at what weight did when constructing scorer. the code is written in this way.

public scorer weight#scorer(IndexReader reader){
TermDocs termDocs = reader.termDocs(term);

if (termDocs == null)
return null;

return new TermScorer(this, termDocs, getSimilarity(searcher),
reader.norms(term.field()));
}
class Scorer{
...
public void score(HitCollector hc) throws IOException {
while (next()) {
hc.collect(doc(), score());
}
}
...
public boolean next() throws IOException {
pointer++;
if (pointer >= pointerMax) {
pointerMax = termDocs.read(docs, freqs); // refill buffer
if (pointerMax != 0) {
pointer = 0;
} else {
termDocs.close(); // close stream
doc = Integer.MAX_VALUE; // set to sentinel value
return false;
}
}
doc = docs[pointer];
return true;
}
...
}

We can see that the only possible answer to the above question is: weight has decided that the query content is in termdocs for scorer when constructing scorer. the Code of scorer also shows that when it traverses all valid documents, the query action behind it is to enumerate an array: Doc [], and the source of this array is termdocs. the remaining question is what role termdoc plays in the entire query-how it reads data

Let's see if termdoc is a Xiami Dongdong.
RAfter creating termdoc, the eader calls termdoc. seek (TERM ). this method finds all the document records corresponding to the term in the hard disk index file. Each record contains the Document ID and the number of times the term appears in the document. these documents are created during indexing. The document records of each term in the index file are closely arranged in order, the seek method can locate the starting position of these records in the index and the total number of documents meeting the term. in the future, the role of termdoc in scorer is to read every document that complies with the term and the TF of the term in this document. Since the index is created, you only need to traverse it in the index file, the DF contained in termdoc will be used for the traversal count of this process. that is to say, the termdoc received by scorer calls seek and has located the Data Location corresponding to the term, so that scorer can traverse all the doc containing the term in termquery.

How scorer traverses all Doc
ReadData is another learning. Interested people may have studied reading a 1000 m file. What is the difference between reading 1 K byte files? Of course, the results are shocking. as long as possible, it is not disappointing to try to read the entire data block instead of reading decimal places in parts, however, no one will construct an object to view the data of an unknown size (which may really contain 1 million documents. anyone will try to avoid such lengthy operations as long as they do not receive the necessary requests. the Lucene designer only calls the read method of termdoc to read data when the next () method is called for the first time when the scorer needs to read data (the simhei part in the Code ). to avoid fragmented reading and reduce hard drive efficiency, termdoc. read () will read all valid documents at a time (of course, only the Document ID and TF are included. During the indexing process, these two data are grouped into special files, all documents corresponding to each term are arranged consecutively in this file to avoid fragmented reading.) scorer calls the next () statement to traverse the array of document IDs returned by read, throughout the traversal process, you only need to ++ the I in the read Doc [I.

TimesWe should be very clear about what happened during the calendar process, simply adding these DOC (this is the search result) to collector one by one. after the query is complete, we will get an int array containing the ID of each result document. to use these query results, you also need to retrieve the results from the document library based on the ID of each document. The transaction process that can be completed only by using searcher.doc (ID) is not discussed in this article.

The so-called search is so simple...

Next section of the previous section

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.