Elasticsearch Version: 5.4
Elasticsearch QuickStart 1th: Getting Started with Elasticsearch
Elasticsearch QuickStart 2nd: Elasticsearch and Kibana installation
Elasticsearch QuickStart 3rd: Elasticsearch Index and document operations
Elasticsearch QuickStart 4th article: Elasticsearch document Query
Elasticsearch is a highly scalable, open-source full-text search and analysis engine. It enables fast, near-real-time storage, search and analysis of large-scale data. It is generally used as the underlying engine/technology, providing strong support for applications with complex search capabilities and requirements.
Elasticsearch can be used in these places:
Suppose there is an online store site in order for customers to search for products that are sold. In this case, you can use Elasticsearch to store your entire product catalog and inventory, and provide a search and automatically give them some advice.
Suppose you want to collect logs or transactional data, analyze, dig, and find trends, statistics, summaries, or anomalies. In this case, you can use LogStash(part of theElasticsearch/logstash/kibana stack) to collect, summarize, and parse your data, and then pass the LogStash Submit the data to Elasticsearch . Once Elasticsearch Gets the data, you can search and aggregate the information you're interested in.
Suppose you run a price alert platform that lets price savvy customers specify a rule, such as "I'm interested in buying a specific electronic gadget, if within the next month, there's a seller's price below $x, I'd like to be notified". In this case, you can submit the seller's price to Elasticsearch , use reverse search (filter), match the price change with the customer query, and notify the customer once the match is found.
Suppose there is an analysis (business intelligence) requirement that wants to quickly investigate, analyze, visualize, and find an ad hoc problem in a large number of data (considering millions or 1 billion of records). In this case, you can use Elasticsearch to store data and then use Kibana (part of the Elasticsearch Stack) to build a custom dashboard to visualize the data that is important to you. In addition, you can use the Elasticsearch aggregation feature to rely on data to perform complex business intelligence queries.
For the remainder of this tutorial, you will be guided through the start and run process of Elasticsearch to get an initial understanding of it and demonstrate some basic operations such as indexing, searching, and modifying data. At the end of this tutorial, you will have a deeper understanding of what Elasticsearch is and how it works. Hopefully you'll be inspired to use it to build complex search applications and to discover useful things from your data.
Basic Concepts ( Basic Concepts)
Some concepts are at the heart of Elasticsearch . Understanding these concepts from the outset will greatly contribute to future learning.
near real-time ( NRT)
Elasticsearch is a near real-time search platform. This means that there is only a slight delay (typically 1 seconds) from the time the document is indexed to the time it becomes searchable.
Cluster ( Cluster)
A cluster is a collection of one or more nodes (servers) that unite to hold all the data and that can be indexed and searched on all nodes. The cluster is identified by a unique name, which is "Elasticsearch" by default. Because a node can belong to only one cluster and join the cluster according to the cluster name. So the name is important.
Do not use the same cluster name in different environments, or it may result in the addition of the wrong cluster. For example, you can use cluster names, Logging-dev, Logging-stage, and Logging-prod , respectively, in development, staging, and production environments.
Note that only one node of the cluster is valid and perfect. You can also have multiple independent clusters, each with its own unique cluster name.
Node ( Node)
A node is a single server that is part of a cluster, stores data, and participates in cluster indexing and searching. Like a cluster, a node is distinguished by a unique name, and the default name is a random uuid (universally unique IDentifier), which is set to that node when the server is started. If you do not want to use the default value, you can also customize the name of the node. The name is important to the administrator, and it helps you to differentiate between the servers in the cluster and which nodes correspond to each other.
The node can be joined to the specified cluster by configuring the name of the cluster. By default, nodes are joined to a cluster called Elasticsearch , which means that if you start a large number of nodes in the network and if they can communicate with each other, then they will be automatically added a name called Elasticsearch the cluster.
Indexing (Index)
An index is a collection of documents that have a similar feature. For example, the customer Data index, the product catalog index, and the order data index. The index is identified by name, which must be all lowercase, and is used when indexing, searching, updating, and deleting documents. In a single cluster, you can define as many indexes as you want.
Types (Type)
An index can define one or more types. Type is the logical category/partition of the index, and you can understand it all. Typically, you define a type for a document that has a common set of fields. For example, a blogging platform, if all the data is stored in a single index. In this index, you can define user data types, blog data types, and comment data types.
Documents (document)
A document is the basic unit that can be indexed. For example, use a document to hold data for a customer, or to save data for a single product, or to save data for a single order. The document is represented using JSON. You can store a large number of documents in an index/type. It is important to note that although the document is essentially stored in the index, it is actually indexed/assigned to a type in the index.
Shards and replicas (shards & replicas)
An index may store large amounts of data, potentially exceeding the hard disk capacity of a single node. For example, if an index stores 1 billion documents and consumes 1TB of hard disk space, the hard disk of a single node may not be enough to store that large amount of data, even if it can be stored, but it may slow down the server's processing of search requests.
To solve this problem, elasticsearch provides a shard feature that will subdivide the index. When you create an index, you can simply define the number of shards you want. Each shard itself has the full functionality of the index and can be stored in any node in the cluster.
Sharding is important for two main reasons:
It allows you to split/scale your content horizontally
It allows you to distribute operations in parallel to shards on multiple nodes, which can improve performance or throughput.
The mechanism for fragmentation distribution, and how its documentation is aggregated back into the search request, is managed entirely by Elasticsearch and is transparent to the user.
In a network/cloud environment, fragmentation can be useful at any time, and it is strongly recommended to use a failover mechanism to prevent shards/nodes from going offline or disappearing. To do this, elasticsearch allows you to copy one or more copies of the index's shards, known as copy shards, or abbreviated copies.
Replicas are important for two main reasons:
If a shard/node fails, high availability is available. Therefore, it is important to note that the original/primary shard with which the replica is copied cannot be assigned on the same node.
It allows you to extend the search volume/throughput because searches can be performed in parallel on all replicas.
In summary, each index can be divided into multiple shards. Each index can also be duplicated 0 times (meaning no replicas) or multiple times. Once replicated, each index will have a primary shard (the copied original shard) and a secondary shard (a copy of the primary shard). You can define the number of fragments and replicas based on the index when you create the index. After you create an index, you can dynamically change the number of replicas at any time, but you cannot change the number of shards afterwards.
By default, each index is assigned 5 primary shards and 11 replication shards, which means that if you have two nodes in your cluster, your index will have 5 primary shards and 5 replication shards, with a total of 10 shards.
Each Elasticsearch shard is a Lucene index, which can have a lot of documents, up to LUCENE-5843, up to 2,147,483,519 (= integer.max_value-128) documents. You can use the _cat/shards API to monitor the Shard size.
Summarize
1, why do not use relational database to do the search? Because the database to implement the search, performance will be very poor, can not be word search.
2. What is full-text search, inverted index, and Lucene? Predecessors have summed up, please refer to the "hands-on teaching you full text search" Apache Lucene
3, the characteristics of Elasticsearch
Can be distributed cluster, the massive data processing in near real time;
It is easy to use out of the box for the user. If the amount of data is small, the operation is not too complicated;
Has the function which the relational database does not have, for example full-text search, synonym processing, correlation degree rank, complex data analysis, massive data near real-time processing;
Lucene-based, hides complexity, provides easy-to-use RESTful API interface, Java API interface
4, the core concept of Elasticsearch
-
Cluster: Cluster contains multiple nodes, each node belonging to which cluster is determined by configuration (default is Elasticsearch)
-
Node: Nodes in the cluster are automatically added to the cluster named "Elasticsearch" by default. A elasticsearch service is a node, such as a machine starting two ES services, there are two nodes.
-
Index: Index, which is equivalent to the MySQL database, contains a bunch of document data with similar structure. The
-
Type: type, which is equivalent to MySQL table, a logical data classification in index.
-
Document: Documents, equivalent to a row of records in a MySQL table, are the smallest data units in ES.
-
Shard: Shards, a single machine cannot store large amounts of data, and ES can divide the data in one index into multiple shard, distributed across multiple servers.
-
Replica: Replica, in order to prevent downtime, Shard is lost, so the smallest high-availability configuration is 2 servers.