0. Business Scenario
Export all data from one of the index's fields in es to a file
1. Introduction of ES Data export method
Es data export method, I mainly found the following aspects, welcome to add:
- ES official api:snapshot and restore module
The snapshot and Restore module allows to create snapshots of individual indices or an entire cluster into a remote Reposi Tory like shared file system, S3, or HDFS. These snapshots is great for backups because they can is restored relatively quickly but they is not archival because th EY can only is restored to versions of Elasticsearch that can read the index.
In short, it is a tool for mirroring and quick replying to ES clusters. Does not meet the demand for a field output requirements, so no longer continue to see. Interested students can view Elasticsearch Reference [5.0]? Modules? Snapshot and Restore
Java Dafa is the most common programming language I use, but running Java scripts on Linux is really a hassle. Throw a link to the Java ES Export file, please help yourself: Elasticsearch using Java API Bulk data import and export
Back to the point, Google search "Elasticsearch export data" the first match results, is a Python script written, the link is: lein-wang/elasticsearch_migrate
#! / usr / bin / python
#coding: utf-8
‘‘ ‘
Export and Import ElasticSearch Data.
Simple Example At __main__
@author: [email protected]
@modifier: [email protected]
@note: uncheck consistency of data, please do it by self
‘‘ ‘
import json
import os
import sys
import time
import urllib2
reload (sys)
sys.setdefaultencoding (‘utf-8’)
class exportEsData ():
size = 10000
def __init __ (self, url, index, type, target_index):
self.url = url + "/" + index + "/" + type + "/ _ search"
self.index = index
self.type = type
self.target_index = target_index #Replace the original index
self.file_name = self.target_index + "_" + self.type + ". json"
def exportData (self):
print ("export data begin ... \ n")
begin = time.time ()
try:
os.remove (self.file_name)
except:
os.mknod (self.file_name)
msg = urllib2.urlopen (self.url) .read ()
#print (msg)
obj = json.loads (msg)
num = obj ["hits"] ["total"]
start = 0
end = num / self.size + 1 # read size data one bulk
while (start <end):
try:
msg = urllib2.urlopen (self.url + "? from =" + str (start * self.size) + "& size =" + str (self.size)). read ()
self.writeFile (msg)
start = start + 1
except urllib2.HTTPError, e:
print ‘There was an error with the request’
print e
break
print (start)
print ("export data end !!! \ n total consuming time:" + str (time.time ()-begin) + "s")
def writeFile (self, msg):
obj = json.loads (msg)
vals = obj ["hits"] ["hits"]
try:
cnt = 0
f = open (self.file_name, "a")
for val in vals:
val_json = val ["_ source"] ["content"]
f.write (str (val_json) + "\ n")
cnt + = 1
finally:
print (cnt)
f.flush ()
f.close ()
class importEsData ():
def __init __ (self, url, index, type):
self.url = url
self.index = index
self.type = type
self.file_name = self.index + "_" + self.type + ". json"
def importData (self):
print ("import data begin ... \ n")
begin = time.time ()
try:
s = os.path.getsize (self.file_name)
f = open (self.file_name, "r")
data = f.read (s)
#There are pits here: Pay attention to the format required for bulk operations (line breaks with \ n)
self.post (data)
finally:
f.close ()
print ("import data end !!! \ n total consuming time:" + str (time.time ()-begin) + "s")
def post (self, data):
print data
print self.url
req = urllib2.Request (self.url, data)
r = urllib2.urlopen (req)
response = r.read ()
print response
r.close ()
if __name__ == ‘__main__’:
‘‘ ‘
Export Data
e.g.
URL index type
exportEsData ("http://10.100.142.60:9200", "watchdog", "mexception") .exportData ()
export file name: watchdog_mexception.json
‘‘ ‘
exportEsData ("http://88.88.88.88:9200", "mtnews", "articles", "corpus") .exportData ()
‘‘ ‘
Import Data
* import file name: watchdog_test.json (important)
"_" front part represents the elasticsearch index
"_" after part represents the elasticsearch type
e.g.
URL index type
mportEsData ("http://10.100.142.60:9200", "watchdog", "test") .importData ()
‘‘ ‘
#importEsData ("http://10.100.142.60:9200", "watchdog", "test") .importData ()
#importEsData ("http://127.0.0.1:9200/_bulk", "chat", "CHAT") .importData ()
#importEsData ("http://127.0.0.1:9200/_bulk", "chat", "TOPIC") .importData ()
3. Problems encountered
Everything is ready, after the Python run code, there is a problem:
"urllib2.HTTPError: HTTP Error 500: Internal Server Error"
And based on the doc count count information in the program, it is found that no matter how bulk size changes (tried 10/50/100/500/1000/5000/10000), it is always stuck in the 10,000th document, and then Urllib throws an exception.
Colleague Huang analysis reasons, may be the following aspects:
- No balance bulk rate, production, more than the consumption capacity, more than the ES service-side TPS (here Huang eldest brother according to life experience suggest a bulk in 5~15MB most suitable)
- System-side issues, need to view logs
First, by adding the sleep statement inside the while loop and reducing the bulk size to reduce the TPS for ES, the HTTP STATUS 500 error is still present in the 10,000 document export, which does not work.
The second reason is that you need to log in to the ES host to see log.
The following information is found in log
Caused by: QueryPhaseExecutionException[Result window is too large, from + size must be less than or equal to: [10000] but was [11000].
See the scroll api for a more efficient way to request lar ge data sets. This limit can be set by changing the [index.max_result_window]
index level parameter.]
As in the Urllib2 HTTP status code meaning article
"5XX response code starts with" 5 "status code indicating that the server side found itself error, unable to continue the request"
It's really a server-side problem.
4, the solution of the method
To the point, since the problem is fixed, then there must be a solution, refer to ES Error result window is too large problem processing
Need to the corresponding index on the configuration, do the following definition:
curl -XPUT http://88.88.88.88:9200/mtnews/_settings -d ‘{ "index" : { "max_result_window" : 10000000}}‘
Modify the Index.max_result_window field that is prompted in log (default is 10000)
5. ES learning Experience
- Find the problem to read the log in time, this can save time 23333
Python urllib2 returns "urllib2 when exporting Elasticsearch data. Httperror:http Error 500:internal Server error "