Elasticsearch database
Elasticsearch is now the technology frontier Big Data Engine, the common combination has Es+logstash+kibana as a set of mature log system, in which Logstash is the ETL tool, Kibana is the data analysis display platform. Es Amazing is his strong search-related capabilities and disaster preparedness Strategy, ES open up a number of interfaces for developers to develop their own plug-ins, es combined with Chinese word segmentation plug-in will give ES search and analysis play a big role in promoting. elasticsearch is indexed and searched using the open source full-Text Search library Apachelucene, saying the architecture must deal with some of the lucene stuff.
Elasticsearch create database
About Lucene:
Elasticsearch vs database
Apachelucene organizes all information written into the index into a structure of inverted indexes (inverted index), which is a data structure that maps word items to documents. The way it works differs from the traditional relational database, where the inverted index is mostly word-oriented rather than document-oriented. And the Lucene index also stores a lot of other information, such as word vectors and so on, each lucene is composed of multiple segments, each segment will only be created once but will be queried multiple times, once created, the segment will no longer be modified. Multiple segments are merged at the stage of the merge phase, and when the merge is determined by the intrinsic mechanism of lucene, the number of segments is reduced after merging, but the corresponding segments themselves grow larger. The process of segment merging is very consuming I/O, and at the same time some of the information that is no longer used is cleaned out. In Lucene, the process of converting data into inverted indexes and converting a complete string into a word term that can be used for search is called analysis. Text parsing is performed by the parser (analyzer), which is composed of a word breaker (tokenizer), a filter, and a character mapper (Character Mapper), and its various functions are obvious. In addition, Lucene has its own set of query languages to help us search and read and write.
Logstash database to elasticsearch
[note] The index in ES refers to a field in the URI when querying/addressing such as: [Host]:[port (9200)]/[index]/[type]/[id]? [option], whereas indexes in Lucene correspond more to the concept of shards in ES.
Why elasticsearch is not a database
Back to elasticsearch,es the architecture follows the design concept with the following characteristics:
1. Reasonable default configuration: simply modify the Yaml configuration file in the node and you can configure it quickly. This is similar to the simplification of the configuration in Spring4.
2. Distributed working mode:es the powerful Zen discovery mechanism not only supports group broadcast but also supports point unicast, and has the wonderful idea of knowing the world.
3. Peer Architecture: automatically backs up shards between nodes and keeps the Shard itself and the sample as "far away" as possible to avoid a single point of failure. And the master node is almost completely equivalent to the data node.
4. easy to expand the new node to the cluster: greatly simplifies the work required to add new nodes to the cluster by developing or operating the operations.
5. Do not add any restrictions to the data structures in the index:es supports multiple data types in an index.
6. quasi-real-time: Search and Version synchronization, because ES is a distributed application, a major challenge is consistency issues, whether indexed or document data, but the fact that ES performance is excellent.
(i) sharding policy
Select the appropriate number of shards and replicas. The shards of ES are divided into two types, primary shards (Primary Shard) and replicas (replicas). By default, ES creates 5 shards for each index, even in a single-machine environment, which is called overallocated (over Allocation), and it does not seem necessary to do so at this point, adding more complexity to the process of distributing documents to shards and processing queries. Fortunately, the excellent performance of ES masks this. If an index is made up of a shard, ES cannot split the index into multiple parts when the size of the index exceeds the capacity of a single node, so you must specify the number of shards needed when you create the index. All we can do now is create a new index and specify that the index has more shards in the initial setting. Conversely, if over-allocation increases the complexity of lucene in merging the query results of the Shard, thus increasing the time-consuming, so we get the following conclusions:
We should use the fewest shards!
The following relationship exists between the number of primary shards and the maximum number of replicas and nodes:
Number of nodes <= number of primary shards * (number of replicas +1)
controls the Shard allocation behavior. These are the optimization methods that need to be considered when creating each index, but is there no way to improve performance from the perspective of sharding if the index is created? Of course not, the first thing you can do is to adjust the type of the Shard allocator, in particular, set the Cluster.routing.allocation.type property in Elasticsearch.yml, there are two kinds of even_shard,balanced (default). Even_shard is as far as possible to ensure that each node has the same number of shards, balanced is based on the weight of the control can be allocated, compared to the previous allocator, it is more critical leakage of some parameters and introduce the ability to adjust the allocation process.
Every time the partition of ES is changed in the data distribution of ES, the most representative is the time when new data nodes are added to the cluster. Of course, the timing of the adjustment of the Shard is not triggered by a certain threshold, ES built in 11 of the arbiter to decide whether to trigger the Shard adjustment, here is not to repeat. In addition, these allocation deployment strategies can be updated at run time, and more properties for configuring shards are also available to Google for your own use.
(ii) Routing optimization
The so-called Routing and IP Network in ES is a tag-like thing. When creating a document, you can add a route property tag to the document through the field. The ES intrinsic mechanism determines which documents have the same routing attributes, and must be assigned to the same shard, whether it is a primary shard or a copy. So, in the process of querying, once you specify the route attribute of interest, ES can search directly to the machine on which the corresponding shard resides, and avoids some work of complex distributed collaboration, which improves the performance of ES. At the same time, assuming that the machine 1 has a routing attribute a document, Machine 2 has a routing attribute B document, then I in the query when the target route attribute is a, even if the machine 2 failure paralysis, the machine 1 structure is not a great influence, so do the query under the disaster situation also proposed a solution. The so-called route, in essence, is a split-bucket (bucketing) operation. Of course, more than one route attribute can be specified in a query, and the mechanism is similar.
(iii) GC tuning on ES
Elasticsearch is essentially a Java program, so configuring the JVM garbage collector itself is a worthwhile task. We use the XMS and xmx parameters of the JVM to provide the specified memory size, essentially providing the JVM's heap space, which triggers a fatal outofmemoryexception when the JVM is running out of heap space. This means that either there is not enough memory, or there is a memory leak. To deal with GC problems, first determine the source of the problem, there are generally two scenarios:
1. Turn on the GC log on Elasticsearch
2. Using the Jstat command
3. Generate Memory Dump
The first article: in the ES profile elasticsearch.yml has the related properties can be configured, about the purpose of each property here is of course not complete.
The second: The Jstat command helps us to look at the usage of each area in the JVM heap and the time-consuming situation of GC.
Article three: The final approach is to dump the JVM's heap space into a file, essentially a snapshot of the JVM heap space.
To learn more about the JVM itself GC tuning method please refer to: http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html
In addition, by modifying the startup parameters of the ES node, the GC can also be adjusted, but in essence it is equivalent to the above method.
(iv) Avoid memory swapping
This is simple, because the operating system's virtual memory paging mechanism can be a barrier to performance, such as data write full memory will be written to the swap partition in Linux.
This can be done by setting the Bootstrap.mlockall in the Elasticsearch.yml file to True, but administrator permissions are required to modify the operating system's relevant configuration file.
(v) Control of Index merging
As mentioned above, the shards and replicas in ES are essentially lucene indexes, and the Lucene index is built on multiple index segments (at least one), and the vast majority of the index files are written only once, read multiple times, and under the control of the Lucene intrinsic mechanism, When a certain condition is met, multiple index segments are merged into a larger index segment, and those old index segments are discarded and removed, which is called a merge of segments.
Lucene the reason to perform a segment merge is simple enough: the smaller the granularity of the index segment, the lower the query performance and the more memory it consumes. Frequent document change operations result in a large number of small index segments that cause file handles to open too many problems, such as modifying the system configuration to increase the maximum number of file open allowed by the system. In general, when an index segment is merged from one to the other, the number of index segments is reduced to improve ES performance. For developers, all we can do is choose the right merger strategy, although the merge is entirely Lucene's task, but as Lucene opens up more configuration excuses, the new version of ES offers three merged strategy Tiered,log_byte_size,log_doc. In addition, ES also provides two schedulers for the combination of Lucene index segments: concurrent and serial. Each of the specific differences, here is not to repeat, just to stimulate.
[Database] ramble on Elasticsearch about ES performance tuning a few things to know (turn)