Python urllib2 returns "urllib2 when exporting Elasticsearch data. Httperror:http Error 500:internal Server error "

Source: Internet
Author: User
Tags response code python script


0. Business Scenario


Export all data from one of the index's fields in es to a file





1. Introduction of ES Data export method


Es data export method, I mainly found the following aspects, welcome to add:


    • ES official api:snapshot and restore module

The snapshot and Restore module allows to create snapshots of individual indices or an entire cluster into a remote Reposi Tory like shared file system, S3, or HDFS. These snapshots is great for backups because they can is restored relatively quickly but they is not archival because th EY can only is restored to versions of Elasticsearch that can read the index.


In short, it is a tool for mirroring and quick replying to ES clusters. Does not meet the demand for a field output requirements, so no longer continue to see. Interested students can view Elasticsearch Reference [5.0]? Modules? Snapshot and Restore


    • The Java API for es:


Java Dafa is the most common programming language I use, but running Java scripts on Linux is really a hassle. Throw a link to the Java ES Export file, please help yourself: Elasticsearch using Java API Bulk data import and export


    • The Python API for es:


Back to the point, Google search "Elasticsearch export data" the first match results, is a Python script written, the link is: lein-wang/elasticsearch_migrate


 
#! / usr / bin / python
#coding: utf-8
‘‘ ‘
    Export and Import ElasticSearch Data.
    Simple Example At __main__
    @author: [email protected]
    @modifier: [email protected]
    @note: uncheck consistency of data, please do it by self
‘‘ ‘


import json
import os
import sys
import time
import urllib2

reload (sys)
sys.setdefaultencoding (‘utf-8’)

class exportEsData ():
    size = 10000
    def __init __ (self, url, index, type, target_index):
        self.url = url + "/" + index + "/" + type + "/ _ search"
        self.index = index
        self.type = type
        self.target_index = target_index #Replace the original index
        self.file_name = self.target_index + "_" + self.type + ". json"
    def exportData (self):
        print ("export data begin ... \ n")
        begin = time.time ()
        try:
            os.remove (self.file_name)
        except:
            os.mknod (self.file_name)
        msg = urllib2.urlopen (self.url) .read ()
        #print (msg)
        obj = json.loads (msg)
        num = obj ["hits"] ["total"]
        start = 0
        end = num / self.size + 1 # read size data one bulk
        while (start <end):
            try:
                msg = urllib2.urlopen (self.url + "? from =" + str (start * self.size) + "& size =" + str (self.size)). read ()
                self.writeFile (msg)
                start = start + 1
            except urllib2.HTTPError, e:
                print ‘There was an error with the request’
                print e
                break
        print (start)
        print ("export data end !!! \ n total consuming time:" + str (time.time ()-begin) + "s")
    def writeFile (self, msg):
        obj = json.loads (msg)
        vals = obj ["hits"] ["hits"]
        try:
            cnt = 0
            f = open (self.file_name, "a")
            for val in vals:
                val_json = val ["_ source"] ["content"]
                f.write (str (val_json) + "\ n")
                cnt + = 1
        finally:
            print (cnt)
            f.flush ()
            f.close ()

class importEsData ():
    def __init __ (self, url, index, type):
        self.url = url
        self.index = index
        self.type = type
        self.file_name = self.index + "_" + self.type + ". json"
    def importData (self):
        print ("import data begin ... \ n")
        begin = time.time ()
        try:
            s = os.path.getsize (self.file_name)
            f = open (self.file_name, "r")
            data = f.read (s)
            #There are pits here: Pay attention to the format required for bulk operations (line breaks with \ n)
            self.post (data)
        finally:
            f.close ()
        print ("import data end !!! \ n total consuming time:" + str (time.time ()-begin) + "s")
    def post (self, data):
        print data
        print self.url
        req = urllib2.Request (self.url, data)
        r = urllib2.urlopen (req)
        response = r.read ()
        print response
        r.close ()

if __name__ == ‘__main__’:
    ‘‘ ‘
        Export Data
        e.g.
                            URL index type
        exportEsData ("http://10.100.142.60:9200", "watchdog", "mexception") .exportData ()
        
        export file name: watchdog_mexception.json
    ‘‘ ‘
    exportEsData ("http://88.88.88.88:9200", "mtnews", "articles", "corpus") .exportData ()
    
    ‘‘ ‘
        Import Data
        
        * import file name: watchdog_test.json (important)
                    "_" front part represents the elasticsearch index
                    "_" after part represents the elasticsearch type
        e.g.
                            URL index type
        mportEsData ("http://10.100.142.60:9200", "watchdog", "test") .importData ()
    ‘‘ ‘
    #importEsData ("http://10.100.142.60:9200", "watchdog", "test") .importData ()
    #importEsData ("http://127.0.0.1:9200/_bulk", "chat", "CHAT") .importData ()
    #importEsData ("http://127.0.0.1:9200/_bulk", "chat", "TOPIC") .importData ()
3. Problems encountered


Everything is ready, after the Python run code, there is a problem:


 
"urllib2.HTTPError: HTTP Error 500: Internal Server Error"


And based on the doc count count information in the program, it is found that no matter how bulk size changes (tried 10/50/100/500/1000/5000/10000), it is always stuck in the 10,000th document, and then Urllib throws an exception.



Colleague Huang analysis reasons, may be the following aspects:


    • No balance bulk rate, production, more than the consumption capacity, more than the ES service-side TPS (here Huang eldest brother according to life experience suggest a bulk in 5~15MB most suitable)
    • System-side issues, need to view logs


First, by adding the sleep statement inside the while loop and reducing the bulk size to reduce the TPS for ES, the HTTP STATUS 500 error is still present in the 10,000 document export, which does not work.



The second reason is that you need to log in to the ES host to see log.



The following information is found in log


 
Caused by: QueryPhaseExecutionException[Result window is too large, from + size must be less than or equal to: [10000] but was [11000]. 
See the scroll api for a more efficient way to request lar       ge data sets. This limit can be set by changing the [index.max_result_window]
index level parameter.]


As in the Urllib2 HTTP status code meaning article


"5XX response code starts with" 5 "status code indicating that the server side found itself error, unable to continue the request"


It's really a server-side problem.


4, the solution of the method


To the point, since the problem is fixed, then there must be a solution, refer to ES Error result window is too large problem processing



Need to the corresponding index on the configuration, do the following definition:


 
curl -XPUT http://88.88.88.88:9200/mtnews/_settings -d ‘{ "index" : { "max_result_window" : 10000000}}‘


Modify the Index.max_result_window field that is prompted in log (default is 10000)


5. ES learning Experience
    • Find the problem to read the log in time, this can save time 23333


Python urllib2 returns "urllib2 when exporting Elasticsearch data. Httperror:http Error 500:internal Server error "


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.