tutorial on using Python to manipulate elasticsearch data indexes _python

Source: Internet
Author: User
Tags apache solr solr

Elasticsearch is a distributed, restful search and Analysis server, like Apache SOLR, which is also based on the Lucence Index Server, but I think the advantages of elasticsearch contrast SOLR are:

    • Lightweight: Easy to install startup, download files after a command can be started;
    • Schema Free: A JSON object of arbitrary structure can be submitted to the server, and the index structure is specified in SOLR using Schema.xml;
    • Multi-index file support: Use different index parameters to create another index file, SOLR needs to be configured separately;
    • Distributed: The configuration of SOLR cloud is more complex.

Environment construction

Start Elasticsearch, Access port in 9200, through the browser can view the returned JSON data, Elasticsearch submit and return the data format are JSON.

>> bin/elasticsearch-f

Install the official Python API and some Python run errors after installing on OS X because the setuptools version is too old and is back to normal after removing the reload.

>> pip Install Elasticsearch

Index operations

For a single index, you can call the Create or index method.

From datetime import datetime from
elasticsearch import elasticsearch
es = Elasticsearch () #create a localhost ser ver connection, or elasticsearch ("IP")
es.create (index= "Test-index", doc_type= "Test-type", id=1, body={
  "any ":" Data "," timestamp ": DateTime.Now ()})

The Elasticsearch Batch index command is bulk, and the current Python API has fewer documentation examples, and it takes a lot of time to read the source code to figure out the submission format for the batch index.

 from datetime import datetime to Elasticsearch import elasticsearch from Elasticsearch I Mport Helpers es = Elasticsearch ("10.18.13.3") j = 0 count = Int (Df[0].count ()) actions = [] while (J < count): ACTI
              On = {"_index": "Tickets-index", "_type": "Tickets", "_id": J + 1, "_source": { "Crawaldate":d F[0][j], "Flight":d F[1][j], "Price": Float (df[2][j)), "Discoun
              T ": float (df[3][j])," date ":d F[4][j]," takeoff ":d f[5][j]," land ":d F[6][j], ' Source ':d f[7][j], ' timestamp ': DateTime.Now ()} actions.append (Action) J + + 1 if (l En (actions) = = 500000): Helpers.bulk (es, actions) del Actions[0:len (Actions)] if (Len (Actions) > 0): Helpers . Bulk (es, actions) del Actions[0:len (actions) 

When the Python API is found to serialize JSON, the data type support is limited, and the numpy.int32 used by the raw data must be converted to int to index. In addition, the current bulk operation defaults to 500 data per submission, I modified to 5000 or even 50000 to test, there will be unsuccessful index.

#helpers. Py Source Code def streaming_bulk (client, Actions, chunk_size=500, Raise_on_error=false, EXPAND_ACTION_CALLB Ack=expand_action, **kwargs): actions = Map (expand_action_callback, actions) # If raise On error are set, we need to C Ollect errors per chunk before raising them errors = [] While true:chunk = islice (Actions, Chunk_size) Bulk_ actions = [] for action, the data in Chunk:bulk_actions.append (action) if data isn't none:bulk_acti  Ons.append (data) if not Bulk_actions:return def bulk (client, actions, Stats_only=false, **kwargs): Success, Failed = 0, 0 # List of errors to being collected is isn't stats_only errors = [] for OK, item in Streaming_bulk (Clien
        T, actions, **kwargs): # go through request-reponse pairs and detect failures if not ok:if not stats_only: Errors.append (item) failed + + = 1 Else:success + + 1 return success, failed if stats_only else er

 Rors

For bulk Delete and update operations on indexes, the corresponding document format is as follows, and it is necessary to update the DOC node in the document.

{
  ' _op_type ': ' delete ', '
  _index ': ' Index-name ',
  ' _type ': ' Document ',
  ' _id ': #,
}
{
  ' _ Op_type ': ' Update ',
  ' _index ': ' Index-name ',
  ' _type ': ' Document ',
  ' _id ': ',
  ' doc ': {' question ': ' The life, Universe and everything. '}
}

Common errors

    • Serializationerror:json data serialization error, usually because the data type of a node value is not supported
    • Requesterror: Incorrect submission data format
    • CONFLICTERROR: Index ID conflict
    • Transporterror: Connection cannot be established

Performance

The above is the comparison of storing the same data using MongoDB and Elasticsearch, although the server and operation are not identical, but it can be seen that the database is more advantageous to bulk write or peso server.

The

Elasticsearch index file is automatically chunking, and the Tens data has no effect on the write speed. However, when the disk space is reached, elasticsearch a file merge error, and a large number of data loss (lost 100多万条), stop the client write, the server can not automatically recover, you must manually stop. This is fatal in a production environment, especially with non-Java clients, and it seems impossible to get Java exceptions to the server at the client, which allows programmers to be very careful with server-side return information.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.