Python urllib2 returns "urllib2 when exporting Elasticsearch data. Httperror:http Error 500:internal Server error "

Last Update:2016-11-24 Source: Internet

Author: User

Tags response code python script

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

0. Business Scenario

Export all data from one of the index's fields in es to a file

1. Introduction of ES Data export method

Es data export method, I mainly found the following aspects, welcome to add:

ES official api:snapshot and restore module

The snapshot and Restore module allows to create snapshots of individual indices or an entire cluster into a remote Reposi Tory like shared file system, S3, or HDFS. These snapshots is great for backups because they can is restored relatively quickly but they is not archival because th EY can only is restored to versions of Elasticsearch that can read the index.

In short, it is a tool for mirroring and quick replying to ES clusters. Does not meet the demand for a field output requirements, so no longer continue to see. Interested students can view Elasticsearch Reference [5.0]? Modules? Snapshot and Restore

The Java API for es:

Java Dafa is the most common programming language I use, but running Java scripts on Linux is really a hassle. Throw a link to the Java ES Export file, please help yourself: Elasticsearch using Java API Bulk data import and export

The Python API for es:

Back to the point, Google search "Elasticsearch export data" the first match results, is a Python script written, the link is: lein-wang/elasticsearch_migrate

 
#! / usr / bin / python
#coding: utf-8
‘‘ ‘
    Export and Import ElasticSearch Data.
    Simple Example At __main__
    @author: [email protected]
    @modifier: [email protected]
    @note: uncheck consistency of data, please do it by self
‘‘ ‘


import json
import os
import sys
import time
import urllib2

reload (sys)
sys.setdefaultencoding (‘utf-8’)

class exportEsData ():
    size = 10000
    def __init __ (self, url, index, type, target_index):
        self.url = url + "/" + index + "/" + type + "/ _ search"
        self.index = index
        self.type = type
        self.target_index = target_index #Replace the original index
        self.file_name = self.target_index + "_" + self.type + ". json"
    def exportData (self):
        print ("export data begin ... \ n")
        begin = time.time ()
        try:
            os.remove (self.file_name)
        except:
            os.mknod (self.file_name)
        msg = urllib2.urlopen (self.url) .read ()
        #print (msg)
        obj = json.loads (msg)
        num = obj ["hits"] ["total"]
        start = 0
        end = num / self.size + 1 # read size data one bulk
        while (start <end):
            try:
                msg = urllib2.urlopen (self.url + "? from =" + str (start * self.size) + "& size =" + str (self.size)). read ()
                self.writeFile (msg)
                start = start + 1
            except urllib2.HTTPError, e:
                print ‘There was an error with the request’
                print e
                break
        print (start)
        print ("export data end !!! \ n total consuming time:" + str (time.time ()-begin) + "s")
    def writeFile (self, msg):
        obj = json.loads (msg)
        vals = obj ["hits"] ["hits"]
        try:
            cnt = 0
            f = open (self.file_name, "a")
            for val in vals:
                val_json = val ["_ source"] ["content"]
                f.write (str (val_json) + "\ n")
                cnt + = 1
        finally:
            print (cnt)
            f.flush ()
            f.close ()

class importEsData ():
    def __init __ (self, url, index, type):
        self.url = url
        self.index = index
        self.type = type
        self.file_name = self.index + "_" + self.type + ". json"
    def importData (self):
        print ("import data begin ... \ n")
        begin = time.time ()
        try:
            s = os.path.getsize (self.file_name)
            f = open (self.file_name, "r")
            data = f.read (s)
            #There are pits here: Pay attention to the format required for bulk operations (line breaks with \ n)
            self.post (data)
        finally:
            f.close ()
        print ("import data end !!! \ n total consuming time:" + str (time.time ()-begin) + "s")
    def post (self, data):
        print data
        print self.url
        req = urllib2.Request (self.url, data)
        r = urllib2.urlopen (req)
        response = r.read ()
        print response
        r.close ()

if __name__ == ‘__main__’:
    ‘‘ ‘
        Export Data
        e.g.
                            URL index type
        exportEsData ("http://10.100.142.60:9200", "watchdog", "mexception") .exportData ()
        
        export file name: watchdog_mexception.json
    ‘‘ ‘
    exportEsData ("http://88.88.88.88:9200", "mtnews", "articles", "corpus") .exportData ()
    
    ‘‘ ‘
        Import Data
        
        * import file name: watchdog_test.json (important)
                    "_" front part represents the elasticsearch index
                    "_" after part represents the elasticsearch type
        e.g.
                            URL index type
        mportEsData ("http://10.100.142.60:9200", "watchdog", "test") .importData ()
    ‘‘ ‘
    #importEsData ("http://10.100.142.60:9200", "watchdog", "test") .importData ()
    #importEsData ("http://127.0.0.1:9200/_bulk", "chat", "CHAT") .importData ()
    #importEsData ("http://127.0.0.1:9200/_bulk", "chat", "TOPIC") .importData ()

3. Problems encountered

Everything is ready, after the Python run code, there is a problem:

 
"urllib2.HTTPError: HTTP Error 500: Internal Server Error"

And based on the doc count count information in the program, it is found that no matter how bulk size changes (tried 10/50/100/500/1000/5000/10000), it is always stuck in the 10,000th document, and then Urllib throws an exception.

Colleague Huang analysis reasons, may be the following aspects:

No balance bulk rate, production, more than the consumption capacity, more than the ES service-side TPS (here Huang eldest brother according to life experience suggest a bulk in 5~15MB most suitable)
System-side issues, need to view logs

First, by adding the sleep statement inside the while loop and reducing the bulk size to reduce the TPS for ES, the HTTP STATUS 500 error is still present in the 10,000 document export, which does not work.

The second reason is that you need to log in to the ES host to see log.

The following information is found in log

 
Caused by: QueryPhaseExecutionException[Result window is too large, from + size must be less than or equal to: [10000] but was [11000]. 
See the scroll api for a more efficient way to request lar       ge data sets. This limit can be set by changing the [index.max_result_window]
index level parameter.]

As in the Urllib2 HTTP status code meaning article

"5XX response code starts with" 5 "status code indicating that the server side found itself error, unable to continue the request"

It's really a server-side problem.

4, the solution of the method

To the point, since the problem is fixed, then there must be a solution, refer to ES Error result window is too large problem processing

Need to the corresponding index on the configuration, do the following definition:

 
curl -XPUT http://88.88.88.88:9200/mtnews/_settings -d ‘{ "index" : { "max_result_window" : 10000000}}‘

Modify the Index.max_result_window field that is prompted in log (default is 10000)

5. ES learning Experience

Find the problem to read the log in time, this can save time 23333

Python urllib2 returns "urllib2 when exporting Elasticsearch data. Httperror:http Error 500:internal Server error "

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More