tutorial on using Python to manipulate elasticsearch data indexes

Source: Internet
Author: User
Tags apache solr solr
Elasticsearch is a distributed, restful search and Analysis server, like Apache SOLR, which is a lucence-based index server, but I think the advantage of Elasticsearch versus SOLR is:

    • Lightweight: Easy to install, download the file after a command can be started;
    • Schema Free: You can submit JSON objects of any structure to the server, using Schema.xml to specify the index structure in SOLR;
    • Multi-index file support: You can create another index file using different index parameters, which need to be configured separately in SOLR;
    • Distributed: The configuration of SOLR cloud is more complex.

Environment construction

Start Elasticsearch, access port at 9200, through the browser can view the returned JSON data, Elasticsearch submitted and returned data format is JSON.

>> bin/elasticsearch-f

Install the official Python API, after installation on OS X some Python run errors, because the Setuptools version is too old caused by the removal of re-installed normal.

>> pip Install Elasticsearch

Index operations

For a single index, you can call the Create or index method.

From datetime import datetimefrom elasticsearch Import elasticsearches = Elasticsearch () #create a localhost server Connec tion, or Elasticsearch ("IP") es.create (index= "Test-index", doc_type= "Test-type", id=1,  body={"any": "Data", " Timestamp ": DateTime.Now ()})

The Elasticsearch Batch index command is bulk, and there are few documentation samples for the Python API, and it took a while to read the source code to figure out the submission format for the bulk index.

From datetime import datetimefrom elasticsearch import elasticsearchfrom elasticsearch Import helperses = Elasticsearch (" 10.18.13.3 ") j = 0count = Int (Df[0].count ()) actions = []while (J < count):   action = {        " _index ":" Tickets-index ",        "_type": "Tickets",        "_id": J + 1,        "_source": {              "crawaldate":d f[0][j],              "Flight":d F[1][j],              " Price ": Float (df[2][j]),              " discount ": Float (df[3][j]),              " date ":d f[4][j],              " takeoff ":d F[5][j],              " Land ":d F[6][j],              " source ":d F[7][j],              " timestamp ": DateTime.Now ()}        }  actions.append (action)  J + = 1  if (len (actions) = = 500000):    helpers.bulk (es, actions)    del Actions[0:len (Actions)]if (Len ( Actions) > 0):  helpers.bulk (es, actions)  del Actions[0:len (Actions)]

When the Python API is found to serialize JSON, the data type support is limited, and the numpy.int32 used by the raw data must be converted to int to be indexed. In addition, the bulk operation is now the default is to submit 500 data each time, I modified to 5000 or even 50000 to test, there will be unsuccessful indexing situation.

#helpers. Py Source Codedef streaming_bulk (client, Actions, chunk_size=500, Raise_on_error=false,    expand_action_ Callback=expand_action, **kwargs):  actions = Map (expand_action_callback, actions)  # If raise On error is set, we n Eed to collect errors per chunk before raising them  errors = [] While  True:    chunk = islice (Actions, Chunk_size )    bulk_actions = []    for the action, data in chunk:      bulk_actions.append (Action)      if data are not None:        Bulk_actions.append (data)    if not bulk_actions:      returndef Bulk (client, actions, Stats_only=false, **kwargs) :  success, failed = 0, 0  # List of errors to being collected is not stats_only  errors = []  for OK, item I n Streaming_bulk (client, Actions, **kwargs):    # go through request-reponse pairs and detect failures    if not ok:
  if not stats_only:        errors.append (item)      failed + = 1    Else:      success + = 1  return success, Failed if stats_only else errors

For the bulk delete and update operations of the index, the corresponding document format is required to update the DOC node in the document.

{  ' _op_type ': ' delete ',  ' _index ': ' Index-name ',  ' _type ': ' Document ',  ' _id ': 42,}{  ' _op_type ': ' Update ',  ' _index ': ' Index-name ',  ' _type ': ' Document ', '  _id ': '  Doc ': {' question ': ' The Life, Universe and Everything. '}}

Common errors

    • Serializationerror:json data serialization error, usually because the data type of a node value is not supported
    • Requesterror: The submission data format is incorrect
    • CONFLICTERROR: Index ID conflict
    • Transporterror: Connection cannot be established

Performance

The above is a comparison between using MongoDB and Elasticsearch to store the same data, although neither the server nor the operation is exactly the same, but it can be seen that the database has an advantage over bulk write or peso server.

The Elasticsearch index file is automatically chunked, and the Tens data is not affected by the write speed. However, when the disk space limit is reached, elasticsearch file merge errors, and a large number of data loss (100多万条), stop client write, the server will not automatically recover, must be stopped manually. This is fatal in a production environment, especially with non-Java clients, which seems unable to get a Java exception on the client side of the server, which makes it very easy for programmers to handle the return information on the server side.

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.