Tutorial on using Python to operate Elasticsearch data indexes, elasticsearch tutorial

Source: Internet
Author: User
Tags apache solr solr elasticsearch tutorial

Tutorial on using Python to operate Elasticsearch data indexes, elasticsearch tutorial

Elasticsearch is a distributed and Restful search and analysis server. Like Apache Solr, it is also an Indexing Server Based on ce. However, I think Elasticsearch has the following advantages over Solr:

  • Lightweight: easy to install and start. A command can be started after the file is downloaded;
  • Schema free: JSON objects of any structure can be submitted to the server. schema. xml is used in Solr to specify the index structure;
  • Multi-index file support: Different index parameters can be used to create another index file, which needs to be configured separately in Solr;
  • Distributed: The configurations of Solr Cloud are complex.

Environment Construction

Start Elasticsearch. The access port is 9200. You can view the returned JSON data in the browser. Both the submitted and returned data formats of Elasticsearch are JSON.

>> bin / elasticsearch -f

Install the official Python API. After installing on OS X, some Python running errors occur because the setuptools version is too old. After deleting and reinstalling, it returns to normal.

>> pip install elasticsearch

Index operation

For a single index, you can call the create or index method.

from datetime import datetime
from elasticsearch import Elasticsearch
es = Elasticsearch () #create a localhost server connection, or Elasticsearch ("ip")
es.create (index = "test-index", doc_type = "test-type", id = 1,
  body = {"any": "data", "timestamp": datetime.now ()})

The command for Elasticsearch batch index is bulk. At present, there are few examples of Python API documentation. It takes a lot of time to read the source code to figure out the submission format of the batch index.

from datetime import datetime
from elasticsearch import Elasticsearch
from elasticsearch import helpers
es = Elasticsearch ("10.18.13.3")
j = 0
count = int (df [0] .count ())
actions = []
while (j <count):
   action = {
        "_index": "tickets-index",
        "_type": "tickets",
        "_id": j + 1,
        "_source": {
              "crawaldate": df [0] [j],
              "flight": df [1] [j],
              "price": float (df [2] [j]),
              "discount": float (df [3] [j]),
              "date": df [4] [j],
              "takeoff": df [5] [j],
              "land": df [6] [j],
              "source": df [7] [j],
              "timestamp": datetime.now ()}
        }
  actions.append (action)
  j + = 1

  if (len (actions) == 500000):
    helpers.bulk (es, actions)
    del actions [0: len (actions)]

if (len (actions)> 0):
  helpers.bulk (es, actions)
  del actions [0: len (actions)]

Here I found that the Python API has limited support for data types when serializing JSON. The NumPy.Int32 used in the original data must be converted to int to be indexed. In addition, the current bulk operation defaults to submitting 500 data each time. I modify it to 5000 or even 50000 for testing, and the index may not be successful.

# helpers.py source code
def streaming_bulk (client, actions, chunk_size = 500, raise_on_error = False,
    expand_action_callback = expand_action, ** kwargs):
  actions = map (expand_action_callback, actions)

  # if raise on error is set, we need to collect errors per chunk before raising them
  errors = []

  while True:
    chunk = islice (actions, chunk_size)
    bulk_actions = []
    for action, data in chunk:
      bulk_actions.append (action)
      if data is not None:
        bulk_actions.append (data)

    if not bulk_actions:
      return

def bulk (client, actions, stats_only = False, ** kwargs):
  success, failed = 0, 0

  # list of errors to be collected is not stats_only
  errors = []

  for ok, item in streaming_bulk (client, actions, ** kwargs):
    # go through request-reponse pairs and detect failures
    if not ok:
      if not stats_only:
        errors.append (item)
      failed + = 1
    else:
      success + = 1

  return success, failed if stats_only else errors

For the batch delete and update operations of the index, the corresponding document format is as follows, and it is necessary to update the doc node in the document.

{
  '_op_type': 'delete',
  '_index': 'index-name',
  '_type': 'document',
  '_id': 42,
}
{
  '_op_type': 'update',
  '_index': 'index-name',
  '_type': 'document',
  '_id': 42,
  'doc': {'question': 'The life, universe and everything.'}
}

Common mistakes

    SerializationError: JSON data serialization error, usually because the data type of a node value is not supported
    RequestError: The submitted data format is incorrect
    ConflictError: Index ID conflict
    TransportError: Connection cannot be established
performance

The above is a comparison of using MongoDB and Elasticsearch to store the same data. Although the server and the operation method are not the same, it can be seen that the database is still more advantageous than the index server for batch writing.

Elasticsearch's index file is automatically partitioned, and reaching tens of millions of data has no effect on the writing speed. However, when the disk space limit was reached, Elasticsearch had a file merge error, and a large amount of data was lost (more than 1 million items were lost). After stopping client writes, the server cannot automatically recover and must be manually stopped. This is fatal in a production environment, especially when using a non-Java client, it seems that the Java exception of the server cannot be obtained on the client, which makes the programmer must handle the return information of the server very carefully.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.