Both MySQL and Lucene can index data and query data by index, one is a relational database, and the other is the core class library for building search engines (SOLR, ElasticSearch). What is the difference between the indexes? Previously wrote a "SOLR and MySQL query performance comparison", just a simple comparison of the next query performance, for the internal principle is not explained, this article simple analysis of the index difference between the two.
MySQL Index implementation
In MySQL, the index is the concept of storage engine level, different storage engine implementation of the index is different, this article mainly discusses the MyISAM and InnoDB two storage engine index implementation way.
MyISAM Index Implementation
The MyISAM engine uses B+tree as the index structure, and the data domain of the leaf node holds the address of the record. is a schematic diagram of the MyISAM index:
Figure 1 shows the main index (Primary key) of a MyISAM table. You can see that the index file of MyISAM only stores the address of the data record. In MyISAM, the primary index and secondary index (secondary key) are structurally indistinguishable, except that the primary index requires that key be unique, and the secondary index key can be duplicated. All leaf nodes of the b+tree contain all the keywords and are arranged in ascending order.
The index of the MyISAM table is separated from the data, and the index is saved in the table name. MYI "file, while the data is saved in the" table name. MYD "in the file.
The index of MyISAM is also called "non-aggregation", and the reason for this is to differentiate it from InnoDB's clustered index.
InnoDB Index Implementation
Although InnoDB also uses B+tree as the index structure, the implementation is very different from MyISAM.
The first major difference is that the InnoDB data file itself is the index file. As you know above, the MyISAM index file and the data file are detached, and the index file only holds the address of the data record. In InnoDB, the table data file itself is an index structure organized by B+tree, and the tree's leaf node data field holds the complete data record. The key of this index is the primary key of the data table, so the InnoDB table data file itself is the primary index.
Figure 2 is the InnoDB Primary index (also the data file), and you can see that the leaf node contains the complete data record. This index is called a clustered index. Because the InnoDB data file itself is clustered by the primary key, the INNODB requires that the table must have a primary key (MyISAM can not), and if it is not explicitly specified, the MySQL system automatically selects a column that uniquely identifies the data record as the primary key, and if no such column exists, Then MySQL automatically generates an implicit field for the InnoDB table as the primary key, which is 6 bytes long and has a length of type.
The second difference from the MyISAM index is that the secondary index of the InnoDB data field stores the value of the corresponding record primary key instead of the address. In other words, all secondary indexes of InnoDB refer to the primary key as the data domain. For example, Figure 3 is a secondary index defined on the COL3:
Here is the ASCII code of the English character as the comparison criterion. Clustered index This implementation makes search by primary key efficient, but secondary index search needs to retrieve two times index: first retrieves the secondary index to obtain the primary key, and then retrieves the record with the primary key to the primary index.
Understanding how index implementations of different storage engines can be useful for proper use and optimization of indexes, such as knowing the InnoDB index implementation, makes it easy to understand why it is not recommended to use too-long fields as primary keys, because all secondary indexes refer to the primary index, and the long primary index makes the secondary index too large. For example, using non-monotonic fields as primary keys is not a good idea in InnoDB because the InnoDB data file itself is a b+tree, and a non-monotonic primary key causes the data file to be frequently split in order to maintain b+tree characteristics when inserting new records, which is inefficient, Using the self-increment field as the primary key is a good choice.
The implementation of the MySQL index is a lot of articles, the above also refer to the MySQL index behind the data structure and algorithm principles, now look at the index principle of lucene.
Lucene Index Implementation
Lucene's index is not an b+tree organization, but an inverted index, and Lucene's inverted index consists of the term Index,team dictionary and the posting list.
There is an inverted index (INVERTEDINDEX) with a positive row index (FORWARDINDEX), the positive row index is the document and its field fields forward corresponding relationship:
DocID |
Name |
Sex |
Age |
1 |
Jack |
Man |
18 |
2 |
Lucy |
Woman |
17 |
3 |
Peter |
Man |
17 |
The inverted index is the field and the corresponding relationship of the document that owns this field:
Sex field:
Age field:
Jack,lucy or 17,18 These are called term, and [1,3] is posting list. The Posting list is an array of type int that stores all document IDs that match a term. So what is term Index and term dictionary?
As above, suppose the name field has a number of term, for example:Carla,sara,elin,ada,patty,kate,selena
If they are arranged in this order, it must be very slow to find a particular term, because the term is not sorted and needs to be filtered all over to find out the specific term. After the sort, it becomes:Ada,carla,elin,kate,patty,sara,selena
This makes it possible to find the term of the target more quickly than the full traversal, using a binary lookup method. How to organize these term is the term dictionary, meaning is the dictionary of the term. With the term dictionary, you can find the target with a relatively small number of comparisons and disk reads. However, the random read operation of the disk is still very expensive, so as little as possible to read the disk, it is necessary to cache some data into memory. But the whole term dictionary itself is too big to be put into memory completely. Then there is the term index. Term index is somewhat like a large chapter table of a dictionary. Like what:
A The term that begins with ..... ..... XXX pages
The term that begins with C ..... ..... XXX pages
The term "E" begins ........ XXX pages
If all the term is English characters, it is possible that the term index is really a 26 English character table. However, the actual situation is that the term may not all be English characters, term can be any byte array. and 26 English characters are not necessarily every character has an equal term, such as the beginning of the X character of the term may not be one, and s beginning of the term is particularly many. The actual term index is a tree of trie:
The example is a trie tree that contains "a", "to", "Tea", "Ted", "Ten", "I", "in", and "Inn". This tree does not contain all the term, it contains some prefixes of the term. The term index allows you to quickly locate an offset in the term dictionary and then look back in the order from that position. Plus some compression techniques (to find out more, to search for Lucene finite state transducers), the term index can be only a few one-tenth of the size of all term, making it possible to cache the entire term index with memory. This is the effect on the whole:
From the term index to the term Dictionary, and then to the posting List, through the keyword of a field to query the results of the process is relatively clear, through the posting List of multiple keywords and or or intersection or the query is also simple.
Comparing MySQL's b+tree indexing principle, you can find:
1) Lucene's term Index and term dictionary actually correspond to the function of MySQL B+tree, which provides the index for keyword key. Lucene's inverted index can be retrieved faster than MySQL's b-tree.
2) term index is stored in memory in the form of FST (finite state transducers), which is characterized by very memory savings. So lucene searches for a keyword key is very fast, and MySQL B+tree needs to read the disk comparison.
3) Term dictionary is stored on disk in the form of block, a block inside the use of public prefix compression, such as all the words ab start can be omitted. So term dictionary can save disk space more than B-tree.
4) Lucene uses different indexing methods for different data types, the above analysis is for the field string, for example, for int, there is a Trieintfield type, for latitude and longitude, you can use Geohash encoding.
5) indexes that are independent of two fields in MySQL cannot be used together, and a composite index must be created for the scene of the federated query, and Lucene can be retrieved using an index for any and or or combination.
Reference:
Data structure and algorithmic principles behind MySQL index: http://blog.codinglabs.org/articles/theory-of-mysql-index.html
Http://stackoverflow.com/questions/4628571/solr-date-field-tdate-vs-date
Http://lucene.apache.org/core/
Comparative analysis of MySQL and Lucene indexes