Tutorial on using Python to operate Elasticsearch data indexes, elasticsearch tutorial
Elasticsearch is a distributed and Restful search and analysis server. Like Apache Solr, it is also an Indexing Server Based on ce. However, I think Elasticsearch has the following advantages over Solr:
- Lightweight: easy to install and start. A command can be started after the file is downloaded;
- Schema free: JSON objects of any structure can be submitted to the server. schema. xml is used in Solr to specify the index structure;
- Multi-index file support: Different index parameters can be used to create another index file, which needs to be configured separately in Solr;
- Distributed: The configurations of Solr Cloud are complex.
Environment Construction
Start Elasticsearch. The access port is 9200. You can view the returned JSON data in the browser. Both the submitted and returned data formats of Elasticsearch are JSON.
>> bin / elasticsearch -f
Install the official Python API. After installing on OS X, some Python running errors occur because the setuptools version is too old. After deleting and reinstalling, it returns to normal.
>> pip install elasticsearch
Index operation
For a single index, you can call the create or index method.
from datetime import datetime
from elasticsearch import Elasticsearch
es = Elasticsearch () #create a localhost server connection, or Elasticsearch ("ip")
es.create (index = "test-index", doc_type = "test-type", id = 1,
body = {"any": "data", "timestamp": datetime.now ()})
The command for Elasticsearch batch index is bulk. At present, there are few examples of Python API documentation. It takes a lot of time to read the source code to figure out the submission format of the batch index.
from datetime import datetime
from elasticsearch import Elasticsearch
from elasticsearch import helpers
es = Elasticsearch ("10.18.13.3")
j = 0
count = int (df [0] .count ())
actions = []
while (j <count):
action = {
"_index": "tickets-index",
"_type": "tickets",
"_id": j + 1,
"_source": {
"crawaldate": df [0] [j],
"flight": df [1] [j],
"price": float (df [2] [j]),
"discount": float (df [3] [j]),
"date": df [4] [j],
"takeoff": df [5] [j],
"land": df [6] [j],
"source": df [7] [j],
"timestamp": datetime.now ()}
}
actions.append (action)
j + = 1
if (len (actions) == 500000):
helpers.bulk (es, actions)
del actions [0: len (actions)]
if (len (actions)> 0):
helpers.bulk (es, actions)
del actions [0: len (actions)]
Here I found that the Python API has limited support for data types when serializing JSON. The NumPy.Int32 used in the original data must be converted to int to be indexed. In addition, the current bulk operation defaults to submitting 500 data each time. I modify it to 5000 or even 50000 for testing, and the index may not be successful.
# helpers.py source code
def streaming_bulk (client, actions, chunk_size = 500, raise_on_error = False,
expand_action_callback = expand_action, ** kwargs):
actions = map (expand_action_callback, actions)
# if raise on error is set, we need to collect errors per chunk before raising them
errors = []
while True:
chunk = islice (actions, chunk_size)
bulk_actions = []
for action, data in chunk:
bulk_actions.append (action)
if data is not None:
bulk_actions.append (data)
if not bulk_actions:
return
def bulk (client, actions, stats_only = False, ** kwargs):
success, failed = 0, 0
# list of errors to be collected is not stats_only
errors = []
for ok, item in streaming_bulk (client, actions, ** kwargs):
# go through request-reponse pairs and detect failures
if not ok:
if not stats_only:
errors.append (item)
failed + = 1
else:
success + = 1
return success, failed if stats_only else errors
For the batch delete and update operations of the index, the corresponding document format is as follows, and it is necessary to update the doc node in the document.
{
'_op_type': 'delete',
'_index': 'index-name',
'_type': 'document',
'_id': 42,
}
{
'_op_type': 'update',
'_index': 'index-name',
'_type': 'document',
'_id': 42,
'doc': {'question': 'The life, universe and everything.'}
}
Common mistakes
SerializationError: JSON data serialization error, usually because the data type of a node value is not supported
RequestError: The submitted data format is incorrect
ConflictError: Index ID conflict
TransportError: Connection cannot be established
performance
The above is a comparison of using MongoDB and Elasticsearch to store the same data. Although the server and the operation method are not the same, it can be seen that the database is still more advantageous than the index server for batch writing.
Elasticsearch's index file is automatically partitioned, and reaching tens of millions of data has no effect on the writing speed. However, when the disk space limit was reached, Elasticsearch had a file merge error, and a large amount of data was lost (more than 1 million items were lost). After stopping client writes, the server cannot automatically recover and must be manually stopped. This is fatal in a production environment, especially when using a non-Java client, it seems that the Java exception of the server cannot be obtained on the client, which makes the programmer must handle the return information of the server very carefully.