Elasticsearch is an open-source system with search engines and NoSQL database features that has sprung up in the last two years, built on Java/lucene. Recently studied, feel Elasticsearch architecture and its open source ecological construction have a lot to learn from, so organized into article sharing under. The Code and schema analysis of this article is based primarily on the latest stable version of Elasticsearch 2.X.
Elasticsearch look at the name can probably understand that it is a flexible search engine. First, the implied meaning of elasticity is distributed, the single-machine system is not able to play up, and then add a flexible mechanism, is the meaning of the Elastic contained here. Its search storage function is mainly provided by Lucene, and Lucene is the equivalent of its storage engine, which encapsulates indexes, queries, and distributed related interfaces.
Several concepts in the Elasticsearch
Cluster (Cluster) a group of nodes that have a common Cluster name.
A Elasticearch instance in a node cluster.
Index is equivalent to the database concept in relational databases and can contain multiple indexes in a cluster. This is a logical concept.
A subset of the primary shard (Primary Shard) index, which can be sliced into multiple shards and distributed across different cluster nodes. A shard corresponds to an index in Lucene.
Replica shards (Replica shard) Each primary shard can have one or more replicas.
Type is equivalent to the table concept in the database, mapping is for type. Multiple Type can be contained in the same index.
Mapping is equivalent to the schema in the database, which is used to constrain the type of the field, but Elasticsearch's Mapping can be automatically created based on the data.
Document is equivalent to row in the database.
field is equivalent to column in the database.
Allocation (Allocation) the process of assigning shards to a node, including allocating primary shards or replicas. If it is a replica, it also contains the process of copying data from the primary shard.
Search Engines
In addition to supporting the search function of Lucene itself, Elasticsearch has made some extensions on top of it. 1. Scripting support
Elasticsearch supports groovy scripts by default, and expands the Lucene scoring mechanism to easily support complex custom scoring algorithms. It only supports sandboxed scripting languages (such as Lucene expression,mustache) by default, and groovy must be explicitly set before it can be turned on. The security mechanism of groovy is to control permissions through the Java.security.AccessControlContext setting a class whitelist, while the 1.x version is a whitelist filter of its own, but the restriction policy has a vulnerability that causes a remote code execution vulnerability. 2. By default, a _all field is generated and the values of all other fields are stitched together. This allows you to search without specifying a field and facilitates cross-field retrieval. 3. Suggester Elasticsearch through an extended indexing mechanism, you can implement auto-complete suggestion like Google and the suggestion of search word error correction.
NoSQL Database
Elasticsearch can be used as a database, relying primarily on its following features:
The original data is saved in the index by default and can be obtained. This relies primarily on Lucene's store functionality.
Translog provides real-time data read capability and complete data persistence capability (the data is still not lost if the server is out of the ordinary). Lucene because there is indexwriter buffer, if the process is abnormally hung, the data in buffer will be lost. So Elasticsearch through Translog to ensure that data is not lost. When the document is read directly through the ID, Elasticsearch attempts to read from the Translog before it is read from the index. That is, even if the data in the buffer has not been flushed to the index, it can still provide real-time data read capability. Elasticsearch Translog defaults to Fsync once per write request and has a scheduled task detection (default 5 seconds). If the business scenario requires greater write throughput, you can tune the Translog-related configuration for optimization.
Strong, the Kibana in its biosphere is mainly dependent on aggregation to achieve data analysis and visualization.
Typical application Scenario One: Cloud analytics Business
Solution: Set the number of shards separately based on the index size and take full advantage of the type merge index
In addition to the word breaker field, all other fields are stored as Doc value, master node, data node, client node Detach deployment conservative settings Fielddata memory footprint, and other memory usage limits
Set the fielddata validity period.
Typical application Scenario two: Casio business
Solution:
Automatically match unknown fields using dynamic mapping
Data distribution to all nodes bulk Import
Use all Doc value storage to reduce memory consumption
Use templates to automatically create indexes at the day and hour levels
SSD and SATA packet, cold data automatically migrated periodically
Give us a brief introduction: Elasticsearch