SOLR distributed search source code analysis

Source: Internet
Author: User

The distributed search Master Logic is implemented in the searchhandler. handlerequestbody method. For details, see distributed request branch.
The distributed search process is divided into stages. Stage control is implemented in the distributedprocess method of each componnet. The request output of each stage is encapsulated in outgoing. Add (sreq.
Component sets and processes parameters for each stage and outputs them to outgoing. If outgoing has a value detected in the while loop, this indicates that the request is required, distributed call;
Distributed invocation submits the query of each shard for the current stage and asynchronously collects the returned results of all shard.
Process the results after each stage is computed.
For (searchcomponent C: Components ){
C. handleresponses (RB, srsp. getshardrequest ());
}

After all stages are completed, call component. finishstage for subsequent processing.
For (searchcomponent C: Components ){
C. finishstage (RB );
}

Stage is divided into two stages: makeQuery and getFileds.
MakeQuery: adds the query parameter fl = id to the url, and whether the score will be included based on the situation. We know that fl indicates only those fields are returned, and Solr will retrieve the uniquerField of the schema configuration file, therefore, this request only returns the id value. After the id value is obtained, QueryComponent merges IDs. If different shard has the same id, only one
GetFileds encapsulates the request parameters through QueryComponent. The most important thing is to encapsulate the ids parameter, that is, put it as a value in the url according to the ids parameter obtained in the preceding request, send another request to obtain the corresponding field based on the id.
In fact, before these two stages, another stage is STAGE_PARSE_QUERY. In this stage, the distributed idf can be calculated. However, solr is not implemented, but by default, each shard only calculates its own idf, does not calculate the global idf, in the case of a large amount of data, based on the Shard level of TF-IDF will not have too much deviation, but if the distributed index is very uneven, it may be necessary to pay attention to the issue of relevance computing.

After the makeQuery stage is complete, perform the mergeIds operation on all docs returned by shards. In the mergeIds implementation, the doc will be placed in the priority queue based on the page and sorted by comparison between sort and score, obtain the doc list on the current page. The priority queue size is start + rows. Only the size of rows is returned to the client, however, you need to sort the top start + rows doc quantity. If there are many pages, the memory overhead of the proxy node and the CPU overhead of sorting calculation will be relatively large.

For example
QueryComponent,
Private void handleregularresponses (responsebuilder RB, shardrequest sreq ){
If (sreq. Purpose & shardrequest. purpose_get_top_ids )! = 0 ){
Mergeids (RB, sreq );
}

If (sreq. Purpose & shardrequest. purpose_get_fields )! = 0 ){
Returnfields (RB, sreq );
}
}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.