tutorial on using Python to manipulate elasticsearch data indexes

Last Update:2016-06-06 Source: Internet

Author: User

Tags apache solr solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Elasticsearch is a distributed, restful search and Analysis server, like Apache SOLR, which is a lucence-based index server, but I think the advantage of Elasticsearch versus SOLR is:

Lightweight: Easy to install, download the file after a command can be started;
Schema Free: You can submit JSON objects of any structure to the server, using Schema.xml to specify the index structure in SOLR;
Multi-index file support: You can create another index file using different index parameters, which need to be configured separately in SOLR;
Distributed: The configuration of SOLR cloud is more complex.

Environment construction

Start Elasticsearch, access port at 9200, through the browser can view the returned JSON data, Elasticsearch submitted and returned data format is JSON.

>> bin/elasticsearch-f

Install the official Python API, after installation on OS X some Python run errors, because the Setuptools version is too old caused by the removal of re-installed normal.

>> pip Install Elasticsearch

Index operations

For a single index, you can call the Create or index method.

From datetime import datetimefrom elasticsearch Import elasticsearches = Elasticsearch () #create a localhost server Connec tion, or Elasticsearch ("IP") es.create (index= "Test-index", doc_type= "Test-type", id=1,  body={"any": "Data", " Timestamp ": DateTime.Now ()})

The Elasticsearch Batch index command is bulk, and there are few documentation samples for the Python API, and it took a while to read the source code to figure out the submission format for the bulk index.

From datetime import datetimefrom elasticsearch import elasticsearchfrom elasticsearch Import helperses = Elasticsearch (" 10.18.13.3 ") j = 0count = Int (Df[0].count ()) actions = []while (J < count):   action = {        " _index ":" Tickets-index ",        "_type": "Tickets",        "_id": J + 1,        "_source": {              "crawaldate":d f[0][j],              "Flight":d F[1][j],              " Price ": Float (df[2][j]),              " discount ": Float (df[3][j]),              " date ":d f[4][j],              " takeoff ":d F[5][j],              " Land ":d F[6][j],              " source ":d F[7][j],              " timestamp ": DateTime.Now ()}        }  actions.append (action)  J + = 1  if (len (actions) = = 500000):    helpers.bulk (es, actions)    del Actions[0:len (Actions)]if (Len ( Actions) > 0):  helpers.bulk (es, actions)  del Actions[0:len (Actions)]

When the Python API is found to serialize JSON, the data type support is limited, and the numpy.int32 used by the raw data must be converted to int to be indexed. In addition, the bulk operation is now the default is to submit 500 data each time, I modified to 5000 or even 50000 to test, there will be unsuccessful indexing situation.

#helpers. Py Source Codedef streaming_bulk (client, Actions, chunk_size=500, Raise_on_error=false,    expand_action_ Callback=expand_action, **kwargs):  actions = Map (expand_action_callback, actions)  # If raise On error is set, we n Eed to collect errors per chunk before raising them  errors = [] While  True:    chunk = islice (Actions, Chunk_size )    bulk_actions = []    for the action, data in chunk:      bulk_actions.append (Action)      if data are not None:        Bulk_actions.append (data)    if not bulk_actions:      returndef Bulk (client, actions, Stats_only=false, **kwargs) :  success, failed = 0, 0  # List of errors to being collected is not stats_only  errors = []  for OK, item I n Streaming_bulk (client, Actions, **kwargs):    # go through request-reponse pairs and detect failures    if not ok:
  if not stats_only:        errors.append (item)      failed + = 1    Else:      success + = 1  return success, Failed if stats_only else errors

For the bulk delete and update operations of the index, the corresponding document format is required to update the DOC node in the document.

{  ' _op_type ': ' delete ',  ' _index ': ' Index-name ',  ' _type ': ' Document ',  ' _id ': 42,}{  ' _op_type ': ' Update ',  ' _index ': ' Index-name ',  ' _type ': ' Document ', '  _id ': '  Doc ': {' question ': ' The Life, Universe and Everything. '}}

Common errors

Serializationerror:json data serialization error, usually because the data type of a node value is not supported
Requesterror: The submission data format is incorrect
CONFLICTERROR: Index ID conflict
Transporterror: Connection cannot be established

Performance

The above is a comparison between using MongoDB and Elasticsearch to store the same data, although neither the server nor the operation is exactly the same, but it can be seen that the database has an advantage over bulk write or peso server.

The Elasticsearch index file is automatically chunked, and the Tens data is not affected by the write speed. However, when the disk space limit is reached, elasticsearch file merge errors, and a large number of data loss (100多万条), stop client write, the server will not automatically recover, must be stopped manually. This is fatal in a production environment, especially with non-Java clients, which seems unable to get a Java exception on the client side of the server, which makes it very easy for programmers to handle the return information on the server side.



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More