In general, when using queries in ES, the first 10 results returned by default, and how we get all the data when we have tens of thousands of results for a query. Although we can set the number of bars returned after a query by size.
The ES API provides scan and scroll, a type of cursor in a traditional database.
Method 1: Directly use the scroll provided by ES
The first step: send the following GET request to the ES server. The contents of {} are written in the request body. Wherein, scroll=1m, set Scroll to remain open within 1min
get/old_index/_search?scroll=10m
{
"Query": {"Match_all": {}},
"Size": 1000
}
After invoking this request, the ES service responds with a JSON similar to the following:
{"_scroll_id": " c2nhbjszozm1mtpvrkjrrhnwbfniv2rplvhlbwlyc1h3ozm1mdpvrkjrrhnwbfniv2rplvhlbwlyc1h3oziznzpnv2jcmkq1rvfbdv90d3zjoevhotl3oze7d g90ywxfagl0czo2njuyow== "," took ": 3," timed_out ": false," _shards ": {" Total ": 3," successful ": 3," Failed ": 0}," hits ": {" Total ": 6652," Max_score ": 0.0," hits ": []}
, where _scroll_id is important in the next use, and _scroll_id is equivalent to a cursor object in a traditional database.
Step two: Send the following GET request to the server. Pass the returned _scroll_id as a parameter to the server. The contents of the second line are written in the request body.
GET/_search/scroll?scroll=1m
c2nhbjszozm1mtpvrkjrrhnwbfniv2rplvhlbwlyc1h3ozm1mdpvrkjrrhnwbfniv2rplvhlbwlyc1h3oziznzpnv2jcmkq1rvfbdv90d3zjoevhotl3oze7d g90ywxfagl0czo2njuyow==
After invoking this request, the ES service responds with a JSON similar to the following:
{"_scroll_id": " c2nhbjszozm1mtpvrkjrrhnwbfniv2rplvhlbwlyc1h3ozm1mdpvrkjrrhnwbfniv2rplvhlbwlyc1h3oziznzpnv2jcmkq1rvfbdv90d3zjoevhotl3oze7d g90ywxfagl0czo2njuyow== "," took ": 2," timed_out ": false," _shards ": {" Total ": 3," successful ": 3," Failed ": 0}," hits ": {" Total ": 101," Max_score ": null," hits ": [{" _index ":" Old_index "," _type ":" 3 "," _id ":" avcoh6dlybq5kuct6s7a "," _score " : 1.0, "_source": {document}},{"_index": "Old_index", "_type": "3", "_id": "avcoh6dlybq5kuct6s7a", "_score": 1.0, "_source": {document }}]}}
Carefully we will find that ES returns the same _scroll_id as the _scroll_id value sent to the server. The description is the same object.
Step three: Repeat the second step until the data in the hits is empty. At this point, all the data for the query is finished
Fourth step: Delete the _scroll_id. The GET request looks like this:
DELETE/_search/scroll
c2nhbjszozm1mtpvrkjrrhnwbfniv2rplvhlbwlyc1h3ozm1mdpvrkjrrhnwbfniv2rplvhlbwlyc1h3oziznzpnv2jcmkq1rvfbdv90d3zjoevhotl3oze7d g90ywxfagl0czo2njuyow==
Attention:
The response to this scroll request includes the first batch of results. Although we specified a size of $, we get back many more documents. When scanning, the size are applied to all shard, so you'll get back a maximum of size * Number_of_primary_shards Docume NTS in each batch.
Method 2: Use the Helpers.scan method provided by Python
Scan uses code:
Scanresp = Helpers.scan (es, _body, scroll= "10m", index= _index, doc_type= _doc_type, timeout= "10m") for
resp in Scanr ESP:
Print resp