Elasticsearch is a distributed, restful search and Analysis server, like Apache SOLR, which is also based on the Lucence Index Server, but I think the advantages of elasticsearch contrast SOLR are:
- Lightweight: Easy to install startup, download files after a command can be started;
- Schema Free: A JSON object of arbitrary structure can be submitted to the server, and the index structure is specified in SOLR using Schema.xml;
- Multi-index file support: Use different index parameters to create another index file, SOLR needs to be configured separately;
- Distributed: The configuration of SOLR cloud is more complex.
Environment construction
Start Elasticsearch, Access port in 9200, through the browser can view the returned JSON data, Elasticsearch submit and return the data format are JSON.
Install the official Python API and some Python run errors after installing on OS X because the setuptools version is too old and is back to normal after removing the reload.
>> pip Install Elasticsearch
Index operations
For a single index, you can call the Create or index method.
From datetime import datetime from
elasticsearch import elasticsearch
es = Elasticsearch () #create a localhost ser ver connection, or elasticsearch ("IP")
es.create (index= "Test-index", doc_type= "Test-type", id=1, body={
"any ":" Data "," timestamp ": DateTime.Now ()})
The Elasticsearch Batch index command is bulk, and the current Python API has fewer documentation examples, and it takes a lot of time to read the source code to figure out the submission format for the batch index.
from datetime import datetime to Elasticsearch import elasticsearch from Elasticsearch I Mport Helpers es = Elasticsearch ("10.18.13.3") j = 0 count = Int (Df[0].count ()) actions = [] while (J < count): ACTI
On = {"_index": "Tickets-index", "_type": "Tickets", "_id": J + 1, "_source": { "Crawaldate":d F[0][j], "Flight":d F[1][j], "Price": Float (df[2][j)), "Discoun
T ": float (df[3][j])," date ":d F[4][j]," takeoff ":d f[5][j]," land ":d F[6][j], ' Source ':d f[7][j], ' timestamp ': DateTime.Now ()} actions.append (Action) J + + 1 if (l En (actions) = = 500000): Helpers.bulk (es, actions) del Actions[0:len (Actions)] if (Len (Actions) > 0): Helpers . Bulk (es, actions) del Actions[0:len (actions)
When the Python API is found to serialize JSON, the data type support is limited, and the numpy.int32 used by the raw data must be converted to int to index. In addition, the current bulk operation defaults to 500 data per submission, I modified to 5000 or even 50000 to test, there will be unsuccessful index.
#helpers. Py Source Code def streaming_bulk (client, Actions, chunk_size=500, Raise_on_error=false, EXPAND_ACTION_CALLB Ack=expand_action, **kwargs): actions = Map (expand_action_callback, actions) # If raise On error are set, we need to C Ollect errors per chunk before raising them errors = [] While true:chunk = islice (Actions, Chunk_size) Bulk_ actions = [] for action, the data in Chunk:bulk_actions.append (action) if data isn't none:bulk_acti Ons.append (data) if not Bulk_actions:return def bulk (client, actions, Stats_only=false, **kwargs): Success, Failed = 0, 0 # List of errors to being collected is isn't stats_only errors = [] for OK, item in Streaming_bulk (Clien
T, actions, **kwargs): # go through request-reponse pairs and detect failures if not ok:if not stats_only: Errors.append (item) failed + + = 1 Else:success + + 1 return success, failed if stats_only else er
Rors
For bulk Delete and update operations on indexes, the corresponding document format is as follows, and it is necessary to update the DOC node in the document.
{
' _op_type ': ' delete ', '
_index ': ' Index-name ',
' _type ': ' Document ',
' _id ': #,
}
{
' _ Op_type ': ' Update ',
' _index ': ' Index-name ',
' _type ': ' Document ',
' _id ': ',
' doc ': {' question ': ' The life, Universe and everything. '}
}
Common errors
- Serializationerror:json data serialization error, usually because the data type of a node value is not supported
- Requesterror: Incorrect submission data format
- CONFLICTERROR: Index ID conflict
- Transporterror: Connection cannot be established
Performance
The above is the comparison of storing the same data using MongoDB and Elasticsearch, although the server and operation are not identical, but it can be seen that the database is more advantageous to bulk write or peso server.
The
Elasticsearch index file is automatically chunking, and the Tens data has no effect on the write speed. However, when the disk space is reached, elasticsearch a file merge error, and a large number of data loss (lost 100多万条), stop the client write, the server can not automatically recover, you must manually stop. This is fatal in a production environment, especially with non-Java clients, and it seems impossible to get Java exceptions to the server at the client, which allows programmers to be very careful with server-side return information.