[Turn] Distributed search Elasticsearch several conceptual analysis

Source: Internet
Author: User

Document

In the Elasticsearch world (or Lucene World), document is the main entity and the word has a special meaning. It refers to JSON that is serialized from the highest level or root object (Root objects) in Elasticsearch, which is stored under a unique ID. Elasticsearch's documents were eventually stored as Lucene documents.

Document Meta Data

A document contains more than just data. It also contains metadata (metadata)-Information about the document. There are three metadata elements that must exist, which are:

Name

Description

_index

Where the document is stored

_type

The type of object the document represents

_id

Unique number of the document

Mapping

ElasticSearch can automatically discover field type by looking at it value, sometimes (in fact usually always) we'll WA NT to configure the mappings ourselves to avoid unpleasant surprises.

Type

Each document in ElasticSearch have its type defined. This allows us-to-store various document types in one index and has different mappings for different document types.

In the program, we use objects to represent "items", such as a user, a blog post, a message, or an email. Each object belongs to a type that defines the properties of the object or the association with the data. The object of the user class may include the name, gender, age, and email address.

In a traditional database, we always store homogeneous data in the same table because their data formats are the same. Similarly, in Elasticsearch, we use the same types of documents to represent similar "things", also because their data structures are the same.

Each type has its own mapping (mapping) or struct definition, which defines the data structure under the current type, similar to the columns in a database table. Documents under all types are stored under the same index, but the mapping tells Elasticsearch how different data should be indexed.

Cluster
Represents a cluster, there are multiple nodes in the cluster, there is a primary node, the main node can be elected, the master-slave node for the internal cluster. One of the concepts of ES is to center, literally understand that there is no central node, this is for the outside of the cluster, because the ES cluster from the outside, in a logical whole, you communicate with any one node and the entire ES cluster communication is equivalent.

Node
Node is a running instance of Elasticsearch. To test, multiple node can be started on the same server, but usually a server has only one node. When the system starts, node uses the broadcast (or the specified multicast) to discover an existing cluster and attempts to join the cluster.

Shard
Represents the index Shard, es can divide a complete index into multiple shards, the advantage is that a large index can be split into multiple, distributed to different nodes, to form a distributed search. The number of shards can only be specified before the index is created, and cannot be changed after the index is created. A shard is a separate Lucene instance that is the underlying unit of work that is automatically managed by Elasticsearch. An index is a logical namespace that contains primary or replia slices. In addition to defining the number of primary shards and replia shards, you do not need to specify the shards that you use directly, and your code only cares about index. Elasticsearch distributes all the shards in the cluster, and is automatically reassigned when the node is added.

Primary Shard
Each document is stored in a separate primary shard. When indexing a document, it is first established on the primary shard and then built on all replica shards of the primary shard.
By default, each index has 5 primary shards. You can scale the number of documents that your index can accept by reducing or increasing the number of primary shards.
After the index is created, you cannot change the number of primary shards in the index.

Replicas
Represents a copy of the index, ES can set a copy of multiple indexes, the role of a copy is to improve the system's fault tolerance, when a node a shard corruption or loss can be recovered from the replica. The second is to improve the query efficiency of ES, ES will automatically load balance the search request.

Routing

When an index is indexed for a document, the index is stored on a primaryshard. The Shard is selected by the hash routing value. By default, routing value is obtained by document ID, or when the document has a specific parent document, obtained from the ID of the parent document (this is to ensure that the child document and parent document are stored in the same shard). The value can be specified at the time of index construction, or given by routing field in mapping.

Recovery
Represents data recovery or redistribution of data, ES when a node joins or exits the index shards are redistributed based on the load of the machine, and data recovery occurs when the node is restarted.

River
Represents a data source for ES and is also a way to synchronize data to ES with other storage methods (such as databases). It is an ES service that exists as a plug-in, and by reading the data in the river and indexing it into ES, the official river has couchdb, RABBITMQ, Twitter, Wikipedia, and river This feature will be highlighted in a later document.

Gateway
Represents the persistent storage of ES indexes, es default is to store the index in memory, and then persist to the hard disk when the memory is full. When the ES cluster is shut down and restarted, the index data is read from the gateway. ES supports multiple types of gateway, with local file system (default), Distributed File System, Hadoop HDFs and Amazon's S3 cloud storage service.

Discovery.zen
Represents the automatic discovery node mechanism of ES, ES is a peer-based system that first searches for existing nodes by broadcasting, and then communicates between nodes through multicast protocols, and also supports point-to-point interactions.

Transport
Represents the way in which ES internal nodes or clusters interact with the client, and by default it interacts with the TCP protocol, and it supports transport protocols (integrated via plug-ins) for the HTTP protocol (JSON format), thrift, servlet, memcached, ZEROMQ, and so on.

[Turn] Distributed search Elasticsearch several conceptual analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.