Original Blog Link
In this series of articles, we will use a new perspective to analyze Elasticsearch. Let's start with some bottom layer of abstraction and move up to the user's perspective. The data structures and behaviors within the Elasticsearch are learned during the period.
Describes inverted index and Word item creation index segment index segment Elasticsearch Index transaction summary
Introduction
The purpose of this article is to better understand the search engines such as Elasticsearch,lucene if they work. When you're driving, you just need to start up your car and step on the pedal, but the old driver will have some idea of the fundamentals of the car. The same is true for search engines. Elasticsearch provides an easy-to-use API that can be adapted to most needs. But to maximize the use of elasticsearch, it is necessary to understand the underlying algorithms and data structures.
We start with the most basic inverted index . Inverted indexes are a very useful data structure and are simple to understand. Lucene is a highly optimized inverted index implementation, but this does not go deep into Lucene implementation details, first to see how the inverted index is built and used. This process affects the search and indexing documents
With the inverted index as the bottom of the abstraction layer, we will discuss how a simple search can (or cannot) be performed efficiently and why it is done. When using an inverted index, we convert the problem to a string prefix matching issue why text processing is important how the index is rebuilt in segment, and how the indexes are affected by searching and updating the elasticsearch in the composition of the Lucene index
From this point, we will learn what happens when searching and indexing in a single elasticsearch node. The second article in this series describes how Elasticsearch handles distributed management. inverted indexes and word items
We have three simple documents: "Winter is coming.", "Ours is the Fury." And "The choice is yours." After some simple text processing (lowercase letters, removing punctuation and cutting words), we can organize an inverted index as shown in the figure above.
The inverted index maps the word item to the document that contains the word item (and, if possible, the position where the word item appears in the document). Because the word item is stored in the Word Dictionary (dictionary), we can quickly find the term to find out whether the word appears in the document (postings). This information is stored in an inverse structure-the positive row index. The positive row index records the document associated with the word item.
A simple multi-word search process: First Find all the words in the word dictionary, as well as the occurrence of the word items corresponding to the information, and then the occurrence of the document collection intersection (for and Soso) or the combination (for or search) operation, get the final result of the document collection. Complex searches also involve the same process.
So an index term is a search unit, and the terms we generate determine which searches can (or cannot) be performed efficiently. For example, in terms of the dictionary above, we can quickly find all the words that begin with "C". But we can't quickly find all the words that contain "ours". In order to find all the words that contain "ours", we have to traverse all the terms and find the term "yours". Unless the index is small, this type of operation can result in high consumption. In terms of complexity, the complexity of terms is looked up by a prefix (O (log (n))) (\mathcal{o}\left (\mathrm{log}\left ()) (n\right) \right), and substrings starting at any position find complexity (O (n)) (\ Mathcal{o}\left (N\right))
In other words, with the prefix string, we can find the required term very effectively. When we have inverted indexes, we want all the lookups to be prefix string problems. Here are a few examples to make this transition, some simpler, and the last almost magical. In order to find all the words ending with "tastic", we can index the reversal of the term (for example "fantastic" → "Citsatnaf"), and then prefix the search for the words in each word entry dictionary with the word cut, for example "yours" is cut to "^yo "You", "we", "urs", "rs$". So when we look for "our" or "urs", we can find the "your" information for the languages that have synthetic words, such as Norwegian and German, we need to "decompose" the term. For example "Donaudampfschiff" is decomposed into {"Donau", "Dampf", "Schiff"}. Geographic coordinates are converted to "geo-hash", for example (60.6384, 6.5017) is converted to "U4U8GYYKK", the longer the string the more accurate the phonetic match is very effective for the name of the person, like Metaphone's algorithm to convert "Smith" into {"SM0", "XMT"} When processing numeric and timestamp data, Lucene automatically generates several word items of varying precision, which make up the prefix storage type, so range search is very efficient. Simply put, the number 123 is stored as "1" Hundred, "12" X and "123". At this point, search all [100,199] became to find all "1" hundred words. This is different from querying a word item that starts with "1" because such a search would include "1234" in order to do a search like "Did you mean" and find similar spellings, and the editing distance also needs to be established when traversing the dictionary of words. This is a very complicated operation, but it's all done in Lucene.
An in-depth understanding of text processing is fundamental to subsequent articles, and we also highlight why text processing is so important for generating indexed terms: For more efficient searches. Create an index
When creating an index, there are a few things to consider in advance: Search speed, index density, the speed of index documents, and the time that these changes can be searched after the document is indexed or updated.
Search speed and index density are interrelated: the smaller the index of the search, the less data that needs to be processed, and the more appropriate memory operation. Compression operations can consume a certain amount of time during the index document phase.
In order to compress the index size, there are a variety of compression techniques to choose from. For example, in order to store the relationship between terms and documents (postings, which can be very large), Lucene takes a sub-encoding ([42, 100, 666] is stored as [42, 58, 566]), variable-length byte storage (small numbers can be stored in one byte) and so on.
Shrinking & compressing data storage means sacrificing the ability to quickly update. In fact, Lucene's index files are immutable files, so there is no update operation. This is different from the data structure of the B-tree, which is capable of updating operations and developing updates.
Deletion is also different from what is normally thought. When the document in the index was deleted, it was only updated in bitmap, and the document was deleted. The index structure itself does not have any updates.
The corresponding update of an index document needs to be deleted first and then reinserted into the updated document. Therefore, it is more expensive to update a document than to insert a document. It is not appropriate to store frequently changing values in Lucene because there is no in-place update.
When a new document is added (possibly through an update), the changes to the index are cached in memory first. The end is completely brushed to disk. Note that the "refresh" here is the Lucene refresh, and the Elasticsearch refresh operation includes Lucene commits and other actions, which are recorded in the transaction log.
The timing of the refresh depends on several factors: the speed at which the change is visible, the memory size for the cache, the I/O saturation, and so on. In general, for the speed of index documents, as long as your I/O system can keep up with the refresh, the larger the cache the better. We'll describe it in more detail in the next section.
These files are written to form an index segment. Index Segment (index segment)
The Lucene index has one or more immutable index segments, and the index segment itself can be thought of as a "mini index." When searching, Lucene does a search on each index segment, filters out deleted documents, and merges the results of all the index segments. When the number of index segments changes, there is more redundancy. In order to control the number of index segments in a manageable scope, Lucene merges the index segments into new index segments based on the merge strategy. Lucene Big Mike McCandless has a good video explanation for this indexed merge. Documents that are flagged for deletion are actually deleted when the index segment is merged. So sometimes a new document is added to a smaller index: because of the merging and slimming.
Elasticsearch and Lucene usually do well when working with index segment merging. The merge policy of Elasticsearch can be changed by modifying the merge settings. You can also use OPTIMIZE-API to force a merge.
Before the index segment is written to disk, the modifications are slow to memory. Previously (Lucene version less than 2.3) each newly added document has its own miniature index segment in memory and is merged only when it is written to disk. Now, with Documentswriter, multiple documents can be made into a larger index segment in memory. In Lucene 4, each thread has a documentswriter that improves the performance of the indexed document in the way it is written. (previously indexed documents must wait until the disk is written to the end)
When a new index segment is generated (whether a new write or merge), some cache invalidation must be caused, which affects the search performance. Caches, such as field caches and filter caches, and index segment bindings, have several different caches for an index segment. Elasticsearch has a warmer-api. The necessary cache can be prepared before the new index segment is available for querying.
The most common cause of a write disk due to Elasticsearch may be a refresh caused by successive indexing of documents. This operation defaults once per second. When new index segments are written to disk, they can be searched. Although flush does not have a commit that consumes performance (because Flush does not wait for write acknowledgement), it creates new index segments, invalidates some caches, and may trigger a merge.
If the throughput of an indexed document is important, such as bulk indexing of documents, spending too much time flush and merging small index segments can become wasteful. In this case, it would be a good idea to set the Refresh_interval to a larger point temporarily, or to disable automatic refresh. Manual refresh is always possible when the index document ends. Elasticsearch Index
"All problems in computer science can is solved by another level of indirection." –david J. Wheeler
An index of Elasticsearch consists of one or more shards, each of which has 0 to multiple replicas. These shards are separate Lucene indexes. That is, each Elasticsearch index consists of multiple Lucene indexes, and each Lucene index is made up of multiple index segments. When searching for a elasticsearch index, it executes on all shards, that is, on all the index segments, and finally merges the results. The same is true for searching multiple elasticsearch indexes. In fact, searching for two elasticsearch indexes with only one shard is almost the same as searching for an index with two shards. Both of them searched for two Lucene indexes.
From this point onwards, all subsequent "indexes" refer to the index of the Elasticsearch.
Shards are the most basic unit of capacity scaling in Elasticsearch. When the document is added to the index, the document is routed to a shard. By default, the Round-robin algorithm is made based on the hash value of the document ID. In the second installment of this series, we'll see more about how shards work. Note that the number of shards is determined and cannot be modified at the time the index is created. In a Shay Elasticsearch share, it's good to explain why a shard is a full Lucene index, explaining the advantages of this approach, and explaining the tradeoffs that are made when compared to other ways.
Elasticsearch's search is flexible enough to specify indexes and shards. Index name templates, index aliases, and search routes are available to many data flow policies. Here we will not expand to speak, this recommendation Zachary Tong article "Customizing document Routing" and Shay Banon share "Big data, search and analytics." An example might be a little enlightening: time-based data, such as logs, creates an index every day (or weekly, monthly), and we can effectively limit the search time range and easily delete old data. Although it is cumbersome to delete documents from the index, it is very inexpensive to delete the entire index. When a user's search must be qualified, it is useful to route all documents of that user to the same shard, which reduces the shard of the search. Transactions
Although Lucene has a concept of business, Elasticsearch does not. All elasticsearch operations are added to the same timeline, which does not need to run through all nodes because the flush operation depends on the timing of each node.
It is very difficult to manage the different nodes in the distributed system and the index segments in different shards. Rather than doing this management, how to make the system faster.
Elasticsearch has a transaction log (transaction log) for appending records to the document being indexed. Appending a document is much simpler than building an index segment, so elasticsearch can persist the document and write it to the memory cache. You can specify a consistency level when indexing a document. For example, you can specify that each replica indexes the document before returning the index document operation. Summary
To sum up, Lucene has so many important properties to look at when creating, updating, and searching for a single index that we have to deal with how the text determines how we can search. The right text analysis is important. The index is now established in memory and then is flush to the index segment on disk. The index segment is immutable, deleting a document is simply tagged delete an index consists of many index segments, the search operation executes on each index segment, and then merges the results of the indexed segments from time to time, depending on the opportunity each index segment has a field cache and filter cache Elasticsearch no transaction
In the next article in this series, we'll look at how search and index document operations are performed in the cluster.
Reference Links:
[1]https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up
[2]http://lucene.apache.org/core/4_4_0/core/overview-summary.html
[3]http://blog.trifork.com/2011/04/01/gimme-all-resources-you-have-i-can-use-them/
[4]http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
[5]https://www.elastic.co/blog/customizing-your-document-routing