Overview
Elasticsearch (ES) is a Lucene-based open source search engine, which is stable, reliable, fast, and also has a good level of scalability, is specifically designed for the distributed environment.
Characteristics
- Easy to install: No other dependencies, download after the installation is very convenient; just modify a few parameters to build up a cluster
- JSON: the input/output format is JSON, which means you don't need to define a Schema for quick and easy
- RESTful: Basic All operations (indexes, queries, even configurations) can be made via the HTTP interface
- Distributed: Node external performance equivalence (each node can be used to do the entrance); Join node Auto-equalization
- Multi-tenancy: can be indexed according to different purposes; Multiple indexes can be manipulated at the same time
Cluster
One of the nodes is an ES process, and multiple nodes form a cluster. In general, each node is running on a different operating system, and after configuring the cluster-related parameters ES will automatically compose the cluster (node Discovery mode can also be configured). In the cluster, the main node is chosen by the main algorithm of ES (the current version 1.2 has a brain fissure problem), and the outside of the cluster can be operated by any node, without the sub-master node (external performance equivalence/de-centering, which facilitates client programming, such as fault re-connection).
Index
"Index" has two meanings:
- As a verb, it refers to the process of "saving" a document into ES, and after indexing a document, we can use ES to search for this document
- As a noun, it refers to the place where the document is saved, which is equivalent to a "library" in a database concept
To facilitate understanding, we can map some of the concepts in ES to our familiar relational database:
Es |
Index |
Type |
Document |
Db |
Library |
Table |
Yes |
Sharding
ES is a distributed system, and we should use it in a clustered way from the start. It saves the index by selecting the appropriate " primary Shard" (Primary Shard), where the index is saved (we can interpret the Shard as a piece of physical storage area). The partitioning of the Shard is fixed and must be determined at the time of installation (default is 5), and cannot be changed after that.
Since there are primary shards, there must be "from" shards, called "Replica Shards" in ES (Replica Shard). There are two main functions of replica shards:
- High availability: A shard node can walk other replica shard nodes, the above shard data can be recovered by other nodes after node recovery
- Load balancing: ES automatically controls search routing based on load conditions, and replica shards can divide load evenly
An example
Here's an example to summarize the above (in conjunction with the diagram below):
- 3 ES nodes (ES-58/59/60) Form a cluster
- The default number of primary shards is used when building a cluster 5,shard0~shard4
- The cluster has two indexes added to Index1, Index2
- These two indexes have "indexed" (saved) two documents respectively
- INDEX1 Index This document was automatically saved by ES to Shard 2, the primary shard in the es-58 node, the replica shard in the es-59 node
- INDEX2 Index This document was automatically saved by ES to Shard 2, the primary shard in the es-59 node, the replica shard in the es-58 node
(This figure is obtained by using the RESTful interface of ES, which will introduce the common interface later)
Multi-tenancy
The multi-tenancy of ES is simply a multi-indexing mechanism for a variety of business uses, with one index for each business (see here for a detailed definition and purpose of multi-tenancy). We mentioned earlier that we can understand the index as a library in a relational database, and that multi-index can be understood as a database system to build multiple libraries for different business use.
In practice, we can isolate their data by one index per tenant, and each index can be configured separately (can be tuned for a particular tenant), which is useful in a typical multi-tenancy scenario: for example, one of our multi-tenant applications needs to provide search support, which is available through the ES Indexing is based on tenants so that each tenant can search for relevant content under its own index.
Restful
This feature is very convenient, the most critical is that the HTTP interface of ES not only can do business operations (index/search), can also be configured, or even shut down ES cluster. Here we introduce several very common interfaces:
- /_CAT/NODES?V: Check cluster status
- /_cat/shards?v: View shard Status
- /${index}/${type}/_search: Search
V is verbose meaning, so it can be more readable (with a header, there is alignment), _cat is to monitor the relevant apis,/_cat?help to get all the interfaces. ${index} and ${type} are specific to an index of a certain type, which is hierarchical. We can also search directly on all types of indexes:/_search.
Official terminology List
Finally, a formal translation of the official glossary to consolidate understanding:
Analysis
Parsing is the process of translating text into query terms (term). Using a different parser, these three phrases: Foo Bar,foo-bar,foo,bar can be decomposed into query terms foo and BAR. These query words will actually be stored in the index. A full-text query of Foo:bar (not query Word query) may be parsed as a query word Foo,bar, which matches the query term saved in the index. This is the parsing process (which includes indexing and searching), which allows ES to make full-text queries.
Cluster cluster
One or more nodes that have the same cluster name make up a cluster. Each cluster automatically selects a master node, and if the primary node fails, the cluster automatically selects the new primary node to replace the failed node.
Document documents
A document is a JSON text stored in ES that can be interpreted as a row in a relational database table. Each document is saved in the index and has a type and ID. A document is a JSON object (hash/hashmap/associative array in some languages) that contains 0 or more fields (key-value pairs). The original JSON text will be saved in the _source field after the index, and the return value in the search is completed by default to include the field.
Id
The ID is used to identify the document, and the index/type/id of a document must be unique. The document ID is automatically generated (if not specified).
Field fields
A document contains a number of fields, or a key-value pair. The value of a field can be a simple (scalar) value (such as a string, integer, date), or it can be a nested structure, such as an array or an object. A field is similar to a column in a relational database table. Each field's mapping has a field type (not confused with the document type), which describes the types of values that the field can hold, such as Integer, String, object. Mapping also allows us to define how a field's values are analyzed.
Index indexes
An index resembles a database in a relational database, and it can be mapped to multiple types. An index is a logically named space that corresponds to 1 or more primary shards and can have 0 or more replica shards.
Mapping mapping
A mapping is similar to a schema definition in a relational database. There is a mapping for each index, which defines each type in the index and the configuration associated with the index. Mappings can display definitions, or they are created automatically when a document is indexed.
Node nodes
A node is an ES running instance in the cluster. When testing, multiple nodes can be on the same server at the same time, and the production environment is typically a node on a server. When the node starts, it uses unicast (or multicast) to discover the cluster with the same cluster name as the one it configured, and tries to join the cluster.
Primary Shard Primary Shard
Each document is saved on a primary shard. When we index a document, it is indexed on a primary shard before it is placed on each replica shard of that primary shard. By default, an index has 5 primary shards. We can specify fewer or more primary shards to scale the number of documents that the index can process. It is important to note that the number of primary shards cannot be modified once the index is created.
Replica shard replica Shard
Each primary shard can have 0 or more replica shards. A copy shard is a copy of the primary Shard, which has two main reasons:
- Failover: When a primary shard fails, a replica shard is promoted to Primary shard
- Improved performance: Get and search requests can be processed by the primary shard or replica shard. By default, each primary shard has a replica shard, and the number of replica shards can be dynamically adjusted. Replica shards and their primary shards do not run concurrently on the same node
Routing Routing
When we index a document, it is saved on a primary Shard, and the selection of the Shard is obtained by a hash of the route value. By default, the route value comes from the document ID, and if the document specifies a parent document, the route value is from the parent document ID (this is to ensure that the subdocument and parent document are saved on the same shard). The value can be specified at index time, or by mapping the route field.
Shard Shards
A shard is a Lucene instance, which is the underlying "work unit" of ES Management. An index is a logical namespace, pointing to the primary shard and replica shards. The number of primary and replica shards for an index must be explicitly specified, and only the interaction of processing and indexing is required when the application code is used, and no interaction with the Shard is involved. Elasticsearch will set up shards on all nodes in the cluster, but the node will automatically fragment when it fails or joins a new node.
Source Field
By default, the _source field in the Fetch and search request return value saves the source JSON text, which allows us to access the source data directly in the returned results without having to retrieve the request again based on the ID. Note: The index's JSON string will be returned in full, regardless of whether it is a valid JSON. The contents of this field also do not describe how the data is indexed.
Term query
A query term is an exact value that is indexed by ES. The query word Foo,foo,foo is different. Query terms can be retrieved using the Query Word query interface.
Text literal
Text (or full text) is ordinary, unstructured text, such as this paragraph. By default, the text is parsed as a query word, and the query word is saved in the index. In order to be able to do full-text search, the text field will be parsed as query words at index, and query keywords will be analyzed as query words when searching, and complete full-text search by comparing query terms.
Type types
A type is similar to a table in a relational database. Each type has several fields that you can use to assign to that type of document. A map defines how each field in the document is parsed.
Reference
Elasticsearch Guide
Elasticsearch Glossary of terms
Elasticsearch of full-text search