Recently participating in a project design of real-time statistical queries based on Elasticsearch as the underlying data framework to provide large data volumes (billion levels), took some time to learn the basic theoretical knowledge of elasticsearch, organized a bit, hoping to be interested in Elasticsearch The students who want to know have some help. At the same time also want to find the content is not correct or have doubts about the place, hope to specify, together to explore, learn, progress.
Introduced
Elasticsearch is a distributed and extensible real-time search and analysis engine.
Elasticsearch is a search engine based on the full text search engine Apache Lucene (TM). Of course Elasticsearch is not just Lucene, it includes not only full-text search functionality, but also the following:
- Distributed real-time file storage, and each of the fields are indexed so that they can be searched.
- Distributed search engine for real-time analysis.
- Can scale to hundreds of servers, processing petabytes of structured or unstructured data.
Basic concepts
First of all, Elasticsearch file storage, Elasticsearch is a document-oriented database, a piece of data here is a document, using JSON as the format of the document serialization, such as the following user data:
{"name": sex": "Male", "age ": 25," birthdate ": "1990/05/01", "about": interests": [ "Sports",
It is easy to think of a database such as MySQL to create a user table, Balabala fields, etc., in Elasticsearch This is a document , of course, this document will belong to a user type , A variety of types exist in an index . Here's a simple list of elasticsearch and relational data terms:
relational database ⇒ database ⇒ table ⇒ row ⇒ column (Columns)
Elasticsearch⇒ index ⇒ type ⇒ document ⇒ field (fields)
A Elasticsearch cluster can contain multiple indexes (databases), meaning that it contains many types (tables). These types contain a lot of documents (rows), and then each document contains a lot of fields (columns).
Elasticsearch's interactions can be either using Java APIs or using HTTP's RESTful API, for example, we're going to insert a record that can simply send an HTTP request:
PUT /megacorp/employee/1{ "name" : "John", "sex" : "Male", "age" : 25, "about" : "I love to go rock climbing", "interests": [ "sports", "music" ]}
Update, the query is similar to the operation, the specific operating manual can be found in the Elasticsearch authoritative guide
Index
Elasticsearch the key is to provide a strong index capabilities, in fact, Infoq this time series database of The Secret (2)-index write very good, I here is also around this combination of their own understanding of further carding, but also hope to help everyone better understand this article.
The essence of the Elasticsearch index:
Everything is designed to improve the performance of your search
Another layer of meaning: in order to improve the performance of the search, will inevitably sacrifice some other aspects, such as insert/update, or other databases do not mix:)
Before you see the insertion of a record into the Elasticsearch, you put a JSON object directly, which has multiple fields, such as name in the example above, sex, age, about, interests , So while inserting this data into Elasticsearch, Elasticsearch also silently 1 indexed these fields-inverted indexes, because Elasticsearch's core functionality is search.
How the Elasticsearch is fast indexed
Infoq's article says that Elasticsearch uses an inverted index that is faster than the B-tree index of a relational database.
What is the B-tree index?
The teacher taught us when I was in college. Binary tree Search efficiency is logn, while inserting a new node does not have to move all nodes, so the tree structure to store the index, can take into account the performance of both insert and query.
So on this basis, combined with the read characteristics of the disk (sequential read/random read), the traditional relational database uses b-tree/b+tree such data structure:
In order to improve the efficiency of the query, reduce the number of disk seek, the multiple values as an array through continuous interval storage, one seek to read multiple data, but also reduce the height of the tree.
What is an inverted index?
Continue with the above example, assuming that there are so few data (for simplicity, remove about, interests these two field):
ID |
Name |
| Age
Sex |
1 |
Kate |
24 |
Female |
2 |
John |
24 |
Male |
3 |
Bill |
29 |
Male |
The ID is the Elasticsearch self-built document ID, then the Elasticsearch index is established as follows:
Name:
| Term
Posting List |
Kate |
1 |
John |
2 |
Bill |
3 |
Age:
| Term
Posting List |
24 |
[Up] |
29 |
3 |
Sex:
| Term
Posting List |
Female |
1 |
Male |
[2,3] |
Posting List
Elasticsearch set up an inverted index for each field, Kate, John, female, these are called term, and [Posting] is the List . The Posting list is an array of int that stores all document IDs that match a term.
See here, don't think it's over, the wonderful part just started ...
Through the posting list such an index seems to be able to quickly find, such as to find age=24 classmate, love to answer the question of xiaoming immediately raised his hand to answer: I know, ID is the students. But what if there are tens of millions of records here? What if you want to find it by name?
Term Dictionary
Elasticsearch in order to quickly find a term, all the term order, dichotomy to find the Term,logn search efficiency, like through a dictionary search, this is the term Dictionary . Now again, it seems like the traditional database through the b-tree way, why say faster than the B-tree query?
Term Index
B-tree by reducing the number of disk seek to improve query performance, Elasticsearch is also using the same idea, directly through the memory to find the term, not read the disk, but if the term too much, the term dictionary will be very large, put the memory is not realistic, so there is Term index , like the index page in a dictionary, is the term in the beginning of a, and on which page it can be understood that term index is a tree:
This tree does not contain all the term, it contains some prefixes of the term. The term index allows you to quickly locate an offset in the term dictionary and then look back in the order from that position.
Therefore, the term index does not need to save all the term, but only some of their prefixes and the term Dictionary block mapping relationship, combined with FST (finite state transducers) compression technology, can make the term The index is cached in memory. From term index to the corresponding term dictionary block location, then go to the disk to find term, greatly reducing the number of random read disk.
At this time, the love asked Xiao Ming again raised his hand: "That FST is God horse East Ah?"
A look on the know that Xiao Ming is a college students as I do not seriously listen to the children, data structure The teacher must have said what is FST. But no way, I also forget, here again to fill the class:
FSTs is finite-state machines that map a term (byte sequence) to an arbitraryoutput .
Suppose we now map the mop, moth, Pop, star, stop and top (term prefixes in term index) to the ordinal: 0,1,2,3,4,5 (The block position of the term dictionary). The simplest way is to define a map<string, Integer>, we find their position corresponding to the seat, but from the perspective of memory consumption, there is no better way? The answer is: FST (theoretical basis here, but I believe 99% of people will not read it carefully)
: o:? Indicates a state
–> represents the process of changing state, the above letter/number indicates state change and weight
Divide the word into a single letter by: O:? and –>, 0 weights are not displayed. if: o: After the branch appears, the weight is labeled, and the last weight on the entire path is added to the corresponding ordinal of the word.
FSTs is finite-state machines that map a term ( byte sequence ) to an arbitrary output.
FST stores all of the term in bytes, which can effectively reduce the storage space, so that the term index is sufficient to put into memory, but this way will also lead to the need for more CPU resources to find.
After the more exciting, see tired classmate can have a cup of coffee ...
Compression techniques
Elasticsearch In addition to the above mentioned with FST compression term index, the posting list also has compression skills. Xiao Ming drank coffee and raised his hand: "Posting list is not already only the document ID is stored?" Still need compression? ”
Well, let's look at the first example, if Elasticsearch needs to index the sex of the classmate (then the traditional relational database has been crying in the toilet ...). If there are thousands of students, and the world is only male/female such two gender, each posting list will have at least million document ID. How effective is elasticsearch to compress these document IDs?
Frame of Reference
Incremental encoding compression, which converts large numbers to decimals, stored as bytes
First of all, elasticsearch requirements Posting list is ordered (in order to improve the performance of the search, and then the wayward requirements are satisfied), the advantage of doing so is to facilitate compression, see the following legend:
If maths is not taught by a P.E. teacher, it is easier to see this compression technique.
The principle is through the increment, the original large number into a decimal to store only the increment value, and then calculated by the bit queue, and finally by Byte storage, not careless although the 2 is also used int (4 bytes) to store.
Roaring bitmaps
When it comes to roaring bitmaps, you have to start with bitmap. Bitmap is a data structure, assuming that there is a posting list:
[1,3,4,7,10]
The corresponding bitmap is:
[1,0,1,1,0,0,1,0,0,1]
Very intuitive, with 0/1 to indicate whether a value exists, such as the value of 10 corresponds to the 10th bit, the corresponding bit value is 1, so that a byte can represent 8 document ID, the old version (5.0 ago) Lucene is in such a way to compress, but this compression method is still not efficient, If you have 100 million documents, you need 12.5MB of storage space, which is just one index field (we tend to have many indexed fields). So someone came up with a more efficient data structure like roaring bitmaps.
The disadvantage of bitmap is that storage spaces grow linearly with the number of documents, and roaring bitmaps need to break this spell to use certain exponential features:
The posting list is divided by 65535, for example, the first block contains a document ID range between 0~65535, the second block has an ID range of 65536~131071, and so on. Then with < quotient, the remainder > combination represents each set of IDs, so that each group ID range within the 0~65535, the rest is good to do, since each group of IDs will not become infinitely large, then we can be the most efficient way to store the ID here.
Careful xiaoming at this time again raised his hand: "Why is the limit of 65535?"
Programmer's World except 1024, 65535 is also a classic value, because it =2^16-1, is exactly 2 bytes can represent the maximum number, a short storage unit, notice the last line in the "If a block have more than 4096 values, Encode as a bit set, and otherwise as a simple array using 2 bytes per value ", if it is large, with saving points with bitset save, small pieces on the straightforward point, 2 bytes I don't care, with a Shor T[] Easy to save.
So why use 4096来 to distinguish between chunks or small pieces?
Personal understanding: It is said that the world of programmers is binary, 4096*2bytes = 8192bytes < 1KB, a disk seek can order a small piece of content read out, then the big one more than 1KB, need to read two times.
Federated Index
It says it's been a single field index for a long, if multiple field indexed federated queries, how does the inverted index meet the requirements of fast queries?
- Use the data structure of the skip list to do the "and" operations quickly, or
- Take advantage of the above mentioned Bitset bitwise "and"
First look at the data structure of the jump table:
An ordered list of level0, pick out a few of the elements to Level1 and Level2, each level of the more upward, the number of pointers picked up, the less the number of elements to find, in turn, from the high level to find low, such as 55, first find Level2 31, and then find Level1 47, Finally found 55, a total of 3 search, find efficiency and 2 fork tree efficiency is equal, but also with a certain amount of space redundancy in exchange for.
Suppose you have the following three posting lists that need a federated index:
If you use a skip table, for each ID in the shortest posting list, look in another two posting list to see if it exists, and finally get the result of the intersection.
If the use of bitset, it is very intuitive, direct bitwise AND, the results are the final intersection.
Summary and thinking
Elasticsearch's index idea:
Move the contents of the disk as far as possible into memory, reduce the number of random disk reads (also using the disk sequential read feature), combined with a variety of artifice compression algorithm, with its harsh attitude to use memory.
Therefore, you need to be aware of using Elasticsearch for indexing:
- Fields that do not need to be indexed must be explicitly defined, because the default is to automatically index the
- Similarly, for a string type of field, the analysis needs to be explicitly defined, because the default is also an analysis
- It is important to choose a regular ID, and an ID that is too random (such as a Java UUID) is not conducive to querying
On the last point, the individual believes that there are several factors:
One (perhaps not the most important) factor: the above-mentioned compression algorithm, is the posting list in the large number of ID compression, if the ID is sequential, or there is a common prefix, such as a certain regularity of the ID, compression ratio will be relatively high;
Another factor: probably the most affecting query performance, should be the last through the posting list ID to the disk to find the document information that step, because Elasticsearch is segment storage, Based on the ID of this large range of term positioning to the segment efficiency directly affect the performance of the last query, if the ID is regular, you can quickly skip the segment without the ID, thereby reducing unnecessary disk reads, specifically, you can refer to this article how to choose an efficient global ID scheme ( Reviews are wonderful too)
Follow the actual development and tuning work to share more content, please look forward to!
Elasticsearch Study Notes