Tutorial on using Python to operate Elasticsearch data indexes, elasticsearch tutorial

Last Update:2015-04-09 Source: Internet

Author: User

Tags apache solr solr elasticsearch tutorial

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Tutorial on using Python to operate Elasticsearch data indexes, elasticsearch tutorial

Elasticsearch is a distributed and Restful search and analysis server. Like Apache Solr, it is also an Indexing Server Based on ce. However, I think Elasticsearch has the following advantages over Solr:

Lightweight: easy to install and start. A command can be started after the file is downloaded;
Schema free: JSON objects of any structure can be submitted to the server. schema. xml is used in Solr to specify the index structure;
Multi-index file support: Different index parameters can be used to create another index file, which needs to be configured separately in Solr;
Distributed: The configurations of Solr Cloud are complex.

Environment Construction

Start Elasticsearch. The access port is 9200. You can view the returned JSON data in the browser. Both the submitted and returned data formats of Elasticsearch are JSON.

>> bin / elasticsearch -f

Install the official Python API. After installing on OS X, some Python running errors occur because the setuptools version is too old. After deleting and reinstalling, it returns to normal.

>> pip install elasticsearch

Index operation

For a single index, you can call the create or index method.

from datetime import datetime
from elasticsearch import Elasticsearch
es = Elasticsearch () #create a localhost server connection, or Elasticsearch ("ip")
es.create (index = "test-index", doc_type = "test-type", id = 1,
body = {"any": "data", "timestamp": datetime.now ()})

The command for Elasticsearch batch index is bulk. At present, there are few examples of Python API documentation. It takes a lot of time to read the source code to figure out the submission format of the batch index.

from datetime import datetime
from elasticsearch import Elasticsearch
from elasticsearch import helpers
es = Elasticsearch ("10.18.13.3")
j = 0
count = int (df [0] .count ())
actions = []
while (j <count):
action = {
"_index": "tickets-index",
"_type": "tickets",
"_id": j + 1,
"_source": {
"crawaldate": df [0] [j],
"flight": df [1] [j],
"price": float (df [2] [j]),
"discount": float (df [3] [j]),
"date": df [4] [j],
"takeoff": df [5] [j],
"land": df [6] [j],
"source": df [7] [j],
"timestamp": datetime.now ()}
}
actions.append (action)
j + = 1

if (len (actions) == 500000):
helpers.bulk (es, actions)
del actions [0: len (actions)]

if (len (actions)> 0):
helpers.bulk (es, actions)
del actions [0: len (actions)]

Here I found that the Python API has limited support for data types when serializing JSON. The NumPy.Int32 used in the original data must be converted to int to be indexed. In addition, the current bulk operation defaults to submitting 500 data each time. I modify it to 5000 or even 50000 for testing, and the index may not be successful.

# helpers.py source code
def streaming_bulk (client, actions, chunk_size = 500, raise_on_error = False,
expand_action_callback = expand_action, ** kwargs):
actions = map (expand_action_callback, actions)

# if raise on error is set, we need to collect errors per chunk before raising them
errors = []

while True:
chunk = islice (actions, chunk_size)
bulk_actions = []
for action, data in chunk:
bulk_actions.append (action)
if data is not None:
bulk_actions.append (data)

if not bulk_actions:
return

def bulk (client, actions, stats_only = False, ** kwargs):
success, failed = 0, 0

# list of errors to be collected is not stats_only
errors = []

for ok, item in streaming_bulk (client, actions, ** kwargs):
# go through request-reponse pairs and detect failures
if not ok:
if not stats_only:
errors.append (item)
failed + = 1
else:
success + = 1

return success, failed if stats_only else errors

For the batch delete and update operations of the index, the corresponding document format is as follows, and it is necessary to update the doc node in the document.

{
'_op_type': 'delete',
'_index': 'index-name',
'_type': 'document',
'_id': 42,
}
{
'_op_type': 'update',
'_index': 'index-name',
'_type': 'document',
'_id': 42,
'doc': {'question': 'The life, universe and everything.'}
}

Common mistakes

SerializationError: JSON data serialization error, usually because the data type of a node value is not supported
RequestError: The submitted data format is incorrect
ConflictError: Index ID conflict
TransportError: Connection cannot be established
performance

The above is a comparison of using MongoDB and Elasticsearch to store the same data. Although the server and the operation method are not the same, it can be seen that the database is still more advantageous than the index server for batch writing.

Elasticsearch's index file is automatically partitioned, and reaching tens of millions of data has no effect on the writing speed. However, when the disk space limit was reached, Elasticsearch had a file merge error, and a large amount of data was lost (more than 1 million items were lost). After stopping client writes, the server cannot automatically recover and must be manually stopped. This is fatal in a production environment, especially when using a non-Java client, it seems that the Java exception of the server cannot be obtained on the client, which makes the programmer must handle the return information of the server very carefully.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More