Comparison and Analysis of MySQL and Lucene indexes, mysqllucene

Last Update:2016-10-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Both MySQL and Lucene can index data and query data through indexes. One is a relational database and the other is a core class library that builds a search engine (Solr, ElasticSearch. What is the difference between the two indexes? I have previously written an article "Comparison of query performance between Solr and MySQL", which simply compares the query performance, but does not explain the Internal principles,This article briefly analyzes the differences between the two indexes.

MySQL index implementation

In MySQL, indexes belong to the concept of storage engine level. Different storage engines implement indexes differently. This article mainly discusses the index implementation methods of MyISAM and InnoDB Storage engines.

MyISAM index implementation

The MyISAM engine uses B + Tree as the index structure. The data domain of the leaf node stores the data record address. Is the principle of MyISAM index:

Figure 1 shows the Primary index (Primary key) of a MyISAM table. It can be seen that the index file of MyISAM only stores the address of the data record. In MyISAM, the primary index and Secondary index (Secondary key) have no difference in structure, but the primary index requires that the key is unique, and the Secondary index key can be repeated. All leaf nodes of B + Tree contain all keywords in ascending order.

The indexes and data of the MyISAM table are separated. The indexes are stored in the "table name. MYI" file, and the data is stored in the "table name. MYD" file.

The index method of MyISAM is also called "non-clustered". The reason for this is to distinguish it from the clustered index of InnoDB.

InnoDB Index implementation

Although InnoDB uses B + Tree as the index structure, the implementation method is different from that of MyISAM.

The first major difference is that the InnoDB data file itself is an index file. As mentioned above, the MyISAM index file is separated from the data file, and the index file only stores the data record address. In InnoDB, the table data file itself is an index structure organized by B + Tree. The leaf node data field of this Tree stores complete data records. The key of this index is the primary key of the data table, so the InnoDB table data file itself is the primary index.

Figure 2 is an InnoDB primary index (also a data file). We can see that the leaf node contains a complete data record. This index is called a clustered index. Because the data files in InnoDB need to be clustered by the primary key, InnoDB requires that the table have a primary key (MyISAM may not). If it is not explicitly specified, mySQL automatically selects a column that uniquely identifies a data record as the primary key. If this column does not exist, MySQL automatically generates an implicit field for the InnoDB table as the primary key, this field is 6 bytes in length and its type is long integer.

The second difference from the MyISAM index is that InnoDB's secondary index data domain stores the value of the primary key of the corresponding record rather than the address. In other words, all secondary indexes of InnoDB reference the primary key as the data domain. For example, Figure 3 shows a secondary index defined on Col3:

Here we use the ASCII code of English characters as a comparison criterion. Clustered index makes the search by primary key very efficient, but secondary index search requires two indexes: first, retrieve the secondary index to obtain the primary key, then, use the primary key to search for the record in the primary index.

Understanding the index implementation methods of different storage engines is very helpful for correct use and optimization of indexes. For example, after knowing the index Implementation of InnoDB, it is easy to understand why it is not recommended to use too long fields as the primary key, because all secondary indexes reference the primary index, too long primary index will make the secondary index too large. For example, it is not a good idea to use non-monotonous fields as the primary key in InnoDB, because the InnoDB data file itself is a B + Tree, non-monotonous primary keys will cause frequent split and adjustment of data files to maintain the features of B + Tree during the insertion of new records, which is very inefficient, using an auto-increment field as the primary key is a good choice.

There are many articles about the implementation of MySQL indexes. The above also refer to "data structure and algorithm principles behind MySQL indexes". Let's take a look at the indexing principles of Lucene.

Lucene index implementation

Lucene indexes are not organized by B + Tree, but inverted indexes. Lucene's inverted indexes consist of Term index, Team Dictionary, and Posting List.

If there is an inverted index (invertedindex), there will be a forward index (forwardindex). The forward index is the positive correspondence between the Document and its field Fields:

DocID

Name

Sex

Age

Jack

Male

Lucy

Female

Peter

Male

The inverted index is the relationship between the Field and the document that owns the Field:

Sex field:

Male

[1, 3]

Female

[2]

Age field:

[1]

[2, 3]

Jack, lucy or 17, 18 are called terms, and [1, 3] is the posting list. The Posting list is an int-type array that stores all document IDs that match a specific term. So What Are Term index and Term dictionary?

As shown above, assume that the name field has many terms, such:Carla, Sara, Elin, Ada, Patty, Kate, Selena

If a specific term is sorted in this order, it is very slow to find a specific term. Because the term is not sorted, You need to filter it all to find a specific term. After sorting, it becomes:Ada, Carla, Elin, Kate, Patty, Sara, Selena

In this way, you can use binary search to find the target term faster than full traversal. The way to organize these terms is the term dictionary, which means the Term dictionary. With Term dictionary, you can use a relatively small number of comparisons and disk reads to find the target. However, random read operations on disks are still very expensive. Therefore, it is necessary to cache some data into the memory as few as possible to read disks. However, the entire Term dictionary itself is too large to be fully stored in the memory. So we have the Term index. Term index is a bit like a dictionary of a large chapter table. For example:

Term at the beginning of ................ Xxx page

Term at the beginning of C ................ Xxx page

Term starting with E ................ Xxx page

If all the terms are English characters, it is possible that the term index is actually composed of 26 English two-dimensional tables. However, the actual situation is that the term may not all be English characters, and the term can be any byte array. In addition, 26 English characters may not always have equal terms for each character. For example, one term starting with x may not exist, and many terms starting with s may exist. The actual term index is a trie tree:

The example is A trie tree that contains "A", "to", "tea", "ted", "ten", "I", "in", and "inn. This tree does not contain all the terms. It contains some prefixes of the terms. You can use the term index to quickly locate an offset of the term dictionary, and then search for the offset from this position. With some compression technology (for more information, search for Lucene Finite State Transducers), the size of the Term index can be only a tenth of the size of all terms, this makes it possible to cache the entire term index with memory.

The overall effect is as follows:

From Term index to Term Dictionary, and then to Posting List, the process of querying the result through the keyword of a field is clearer, it is also easy to query and or intersection OR union through the Posting List with multiple keywords.

Comparing the principles of B + Tree indexes in MySQL, we can find that:

1) Lucene's Term index and Term Dictionary actually correspond to MySQL's B + Tree function, providing indexes for key words. Lucene's inverted index can be faster than MySQL's B-tree retrieval.

2) The Term index is stored in the memory in the form of FST (finite state transducers), which features very memory saving. Therefore, Lucene can search for a key keyword very quickly, while MySQL B + Tree needs to read disk comparison.

3) Term dictionary is stored in blocks on the disk. A block is compressed using a public prefix. For example, AB can be omitted if all words start with AB. In this way, Term dictionary can save more disk space than B-tree.

4) Lucene adopts different indexing methods for different data types. The above analysis is for the field as a string, such as for int, TrieIntField, and longitude and latitude, you can use GeoHash encoding.

5) indexes independently created for two fields in Mysql cannot be used together. composite indexes must be created for the scenario of joint queries. Lucene can use indexes in any and or combination for retrieval.

The above is all the content of the comparative analysis of MySQL and Lucene indexes provided by xiaobian. I hope it will be helpful to you and support more customers ~

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Comparison and Analysis of MySQL and Lucene indexes, mysqllucene

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support