tutorial on using Python to manipulate elasticsearch data indexes

tutorial on using Python to manipulate elasticsearch data indexes _python

Last Update:2017-01-19 Source: Internet

Author: User

Tags apache solr solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Elasticsearch is a distributed, restful search and Analysis server, like Apache SOLR, which is also based on the Lucence Index Server, but I think the advantages of elasticsearch contrast SOLR are:

Lightweight: Easy to install startup, download files after a command can be started;
Schema Free: A JSON object of arbitrary structure can be submitted to the server, and the index structure is specified in SOLR using Schema.xml;
Multi-index file support: Use different index parameters to create another index file, SOLR needs to be configured separately;
Distributed: The configuration of SOLR cloud is more complex.

Environment construction

Start Elasticsearch, Access port in 9200, through the browser can view the returned JSON data, Elasticsearch submit and return the data format are JSON.

>> bin/elasticsearch-f

Install the official Python API and some Python run errors after installing on OS X because the setuptools version is too old and is back to normal after removing the reload.

>> pip Install Elasticsearch

Index operations

For a single index, you can call the Create or index method.

From datetime import datetime from
elasticsearch import elasticsearch
es = Elasticsearch () #create a localhost ser ver connection, or elasticsearch ("IP")
es.create (index= "Test-index", doc_type= "Test-type", id=1, body={
  "any ":" Data "," timestamp ": DateTime.Now ()})

The Elasticsearch Batch index command is bulk, and the current Python API has fewer documentation examples, and it takes a lot of time to read the source code to figure out the submission format for the batch index.

 from datetime import datetime to Elasticsearch import elasticsearch from Elasticsearch I Mport Helpers es = Elasticsearch ("10.18.13.3") j = 0 count = Int (Df[0].count ()) actions = [] while (J < count): ACTI
              On = {"_index": "Tickets-index", "_type": "Tickets", "_id": J + 1, "_source": { "Crawaldate":d F[0][j], "Flight":d F[1][j], "Price": Float (df[2][j)), "Discoun
              T ": float (df[3][j])," date ":d F[4][j]," takeoff ":d f[5][j]," land ":d F[6][j], ' Source ':d f[7][j], ' timestamp ': DateTime.Now ()} actions.append (Action) J + + 1 if (l En (actions) = = 500000): Helpers.bulk (es, actions) del Actions[0:len (Actions)] if (Len (Actions) > 0): Helpers . Bulk (es, actions) del Actions[0:len (actions)

When the Python API is found to serialize JSON, the data type support is limited, and the numpy.int32 used by the raw data must be converted to int to index. In addition, the current bulk operation defaults to 500 data per submission, I modified to 5000 or even 50000 to test, there will be unsuccessful index.

#helpers. Py Source Code def streaming_bulk (client, Actions, chunk_size=500, Raise_on_error=false, EXPAND_ACTION_CALLB Ack=expand_action, **kwargs): actions = Map (expand_action_callback, actions) # If raise On error are set, we need to C Ollect errors per chunk before raising them errors = [] While true:chunk = islice (Actions, Chunk_size) Bulk_ actions = [] for action, the data in Chunk:bulk_actions.append (action) if data isn't none:bulk_acti  Ons.append (data) if not Bulk_actions:return def bulk (client, actions, Stats_only=false, **kwargs): Success, Failed = 0, 0 # List of errors to being collected is isn't stats_only errors = [] for OK, item in Streaming_bulk (Clien
        T, actions, **kwargs): # go through request-reponse pairs and detect failures if not ok:if not stats_only: Errors.append (item) failed + + = 1 Else:success + + 1 return success, failed if stats_only else er

 Rors

For bulk Delete and update operations on indexes, the corresponding document format is as follows, and it is necessary to update the DOC node in the document.

{
  ' _op_type ': ' delete ', '
  _index ': ' Index-name ',
  ' _type ': ' Document ',
  ' _id ': #,
}
{
  ' _ Op_type ': ' Update ',
  ' _index ': ' Index-name ',
  ' _type ': ' Document ',
  ' _id ': ',
  ' doc ': {' question ': ' The life, Universe and everything. '}
}

Common errors

Serializationerror:json data serialization error, usually because the data type of a node value is not supported
Requesterror: Incorrect submission data format
CONFLICTERROR: Index ID conflict
Transporterror: Connection cannot be established

Performance

The above is the comparison of storing the same data using MongoDB and Elasticsearch, although the server and operation are not identical, but it can be seen that the database is more advantageous to bulk write or peso server.

The

Elasticsearch index file is automatically chunking, and the Tens data has no effect on the write speed. However, when the disk space is reached, elasticsearch a file merge error, and a large number of data loss (lost 100多万条), stop the client write, the server can not automatically recover, you must manually stop. This is fatal in a production environment, especially with non-Java clients, and it seems impossible to get Java exceptions to the server at the client, which allows programmers to be very careful with server-side return information.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More