ElasticSearch 2 (9)-a summary of the story under ElasticSearch (a plot search)
First top-down, after the bottom-up introduction of the elasticsearch of the bottom of the working principle, to try to answer the following questions:
Why doesn't my search *foo-bar* match foo-bar ?
Why do you add more files to compress indexes (index)?
Why does Elasticsearch occupy a lot of memory?
Version
Elasticsearch version: elasticsearch-2.2.0
Content plots elasticsearch clusters on the cloud
The box in the cluster
Each white square box in the cloud represents a node--node.
Between nodes
In one or more nodes directly, multiple green squares are grouped together to form an Elasticsearch index.
Small squares in the index
Under an index, a small green block distributed across multiple nodes is called a shard--shard.
Shard=lucene Index
A Elasticsearch shard is essentially a lucene Index.
Lucene is a full Text search library (there are also many other forms of search libraries), and Elasticsearch is built on Lucene. The rest of the story is actually about how Elasticsearch works based on Lucene.
Graphical Lucenemini Index--segment
There are a lot of small segment in Lucene, and we can think of them as mini-index inside Lucene.
Segment interior
Has a lot of data structures
- Inverted Index
- Stored fields
- Document Values
- Cache
The most important inverted Index
Inverted index mainly consists of two parts:
- An Ordered data dictionary dictionary (including the word term and the frequency at which it appears).
- A postings (that is, a file that has the word) that corresponds to a term.
When we search, we first decompose the contents of the search, and then find the corresponding term in the dictionary to find the content of the file related to the search.
Query "The Fury"
Auto-complete (autocompletion-prefix)
If you want to find letters that start with the letter "C", you can simply find words such as "choice", "coming" in the inverted Index table by binary search (binary searches).
Expensive to find
If you want to find all the words that contain "our" letters, then the system will scan the entire inverted Index, which is very expensive.
In this case, if you want to optimize, then the question we face is how to generate the appropriate term.
The transformation of the problem
For the above-mentioned issues, we may have several possible solutions:
* Suffix-xiffus *
If we want to use the suffix as a search condition, we can reverse-process the term.
(60.6384, 6.5017), U4U8GYYKK
For Geo location information, you can convert it to Geo Hash.
123, {1-hundreds, 12-tens, 123}
For simple numbers, you can generate multiple forms of term for it.
Resolve spelling errors
A python library creates a tree-state machine that contains misspelled information for a word, which solves the problem of spelling errors.
Stored Field Fields Lookup
Inverted index does not solve this problem when we want to find a file that contains a particular header content, so Lucene provides another data structure stored fields to solve this problem. Essentially, Stored fields is a simple key-value pair of key-value. By default, Elasticsearch stores the JSON source for the entire file.
Document values in order to sort, aggregate
Even so, we find that the above structure still does not solve such things as: sorting, aggregation, facet, because we may want to read a lot of unnecessary information.
So, another data structure solves this problem: Document Values. This structure is essentially a column-like storage, which optimizes the storage structure with the same type of data.
To improve efficiency, elasticsearch can read all of the document value under the index into memory, which greatly increases the access speed, but also consumes a lot of memory space.
In summary, these data structures inverted Index, Stored fields, Document values, and their caches, all within segment.
When a search occurs
When searching, Lucene searches for all segment and then returns each segment search result, which is then presented to the customer in the final merge.
Some of the features of Lucene make this process very important:
Segments is immutable (immutable)
Delete? When the deletion occurs, Lucene does just the deletion of its flag position, but the file will still be in its original place and will not change.
Update? So for the update, essentially it does the job of deleting and then re- indexing (re-index)
The ubiquitous compression
Lucene is very good at compressing data, and basically all of the textbook compression methods can be found in Lucene.
Cache all of all
Lucene also caches all of the information, which greatly improves its query efficiency.
The cached story
When the Elasticsearch index a file, the corresponding cache is created for the file, and the data is refreshed periodically (every second), and the files can be searched.
With the increase of time, we will have a lot of segments,
So Elasticsearch will merge these segment, and in the process, segment will eventually be erased.
This is why adding files may make the index occupy less space, it causes the merge, and there may be more compression.
Give me a chestnut.
There are two segment that will merge.
These two segment will eventually be deleted and then merged into a new segment
At this point the new segment is in the cold state in the cache, but most segment remain unchanged and in warm state.
The above scenes often occur within the Lucene index.
Search in Shard
Elasticsearch the process of searching from Shard is similar to the process of searching in Lucene segment.
Unlike the search in Lucene segment, Shard may be distributed across different node, so when searching and returning results, all information is transmitted over the network.
It is important to note that:
1 searches find 2 shard = 2 times respectively Search Shard
Handling of log files
When we want to search for logs generated by a particular date, the search efficiency is greatly improved by chunking and indexing log files based on timestamps.
It's also handy when we want to delete old data, just delete the old index.
In the case of the above, each index has two shards
How to scale
Shard does not make further splits, but Shard may be transferred to different nodes
So, if the cluster node pressure increases to a certain extent, we may consider adding new nodes, which will require us to re-index all the data, which we do not want to see, so we need to consider at the time of planning, how to balance the relationship between enough nodes and insufficient nodes.
Node Assignment and shard optimization
- Assigning better-performing machines to more important data index nodes
- Make sure that each shard has a copy of the information replica
Routing Routing
Each node has a single copy of the routing table, so when requested to any one node, Elasticsearch has the ability to forward the request to the Shard of the desired node for further processing.
A real request.
Query
Query has a type filtered, and a multi_match of queries
Aggregation
According to the author of the aggregation, get Top10 hits of the TOP10 author's information
Request distribution
This request may be distributed to any node in the cluster
God node
The node then becomes the coordinator of the current request (coordinator), which decides:
- Depending on the index information, determine which core node the request will be routed to
- And which copy is available
- Wait a minute
Routing
Before the real search
ElasticSearch will convert query into Lucene query
Then perform the calculations in all the segment
There is also a cache for the filter condition itself
But queries will not be cached, so if the same query is executed repeatedly, the application needs to do its own caching
So
- Filters can be used at any time
- Query is only used when score is required
Return
After the search is finished, the result is returned up and down the path along the downstream line.
Reference
Reference Source:
Slideshare:elasticsearch from the Bottom up
Youtube:elasticsearch from the bottom up
Wiki:document-term Matrix
Wiki:search engine Indexing
Skip List
Standford Edu:faster Postings list intersection via skip pointers
Stackoverflow:how A search index works when querying many words?
Stackoverflow:how does Lucene calculate intersection of documents so fast?
Lucene and its magical indexes
Misspellings 2.0c:a tool to detect misspellings
Turn: Under Elasticsearch (the story of the plot search)