Comparison and Analysis of MySQL and Lucene indexes, mysqllucene

Source: Internet
Author: User

Comparison and Analysis of MySQL and Lucene indexes, mysqllucene

Both MySQL and Lucene can index data and query data through indexes. One is a relational database and the other is a core class library that builds a search engine (Solr, ElasticSearch. What is the difference between the two indexes? I have previously written an article "Comparison of query performance between Solr and MySQL", which simply compares the query performance, but does not explain the Internal principles,This article briefly analyzes the differences between the two indexes.

MySQL index implementation

In MySQL, indexes belong to the concept of storage engine level. Different storage engines implement indexes differently. This article mainly discusses the index implementation methods of MyISAM and InnoDB Storage engines.

MyISAM index implementation

The MyISAM engine uses B + Tree as the index structure. The data domain of the leaf node stores the data record address. Is the principle of MyISAM index:

Figure 1 shows the Primary index (Primary key) of a MyISAM table. It can be seen that the index file of MyISAM only stores the address of the data record. In MyISAM, the primary index and Secondary index (Secondary key) have no difference in structure, but the primary index requires that the key is unique, and the Secondary index key can be repeated. All leaf nodes of B + Tree contain all keywords in ascending order.

The indexes and data of the MyISAM table are separated. The indexes are stored in the "table name. MYI" file, and the data is stored in the "table name. MYD" file.

The index method of MyISAM is also called "non-clustered". The reason for this is to distinguish it from the clustered index of InnoDB.

InnoDB Index implementation

Although InnoDB uses B + Tree as the index structure, the implementation method is different from that of MyISAM.

The first major difference is that the InnoDB data file itself is an index file. As mentioned above, the MyISAM index file is separated from the data file, and the index file only stores the data record address. In InnoDB, the table data file itself is an index structure organized by B + Tree. The leaf node data field of this Tree stores complete data records. The key of this index is the primary key of the data table, so the InnoDB table data file itself is the primary index.

Figure 2 is an InnoDB primary index (also a data file). We can see that the leaf node contains a complete data record. This index is called a clustered index. Because the data files in InnoDB need to be clustered by the primary key, InnoDB requires that the table have a primary key (MyISAM may not). If it is not explicitly specified, mySQL automatically selects a column that uniquely identifies a data record as the primary key. If this column does not exist, MySQL automatically generates an implicit field for the InnoDB table as the primary key, this field is 6 bytes in length and its type is long integer.

The second difference from the MyISAM index is that InnoDB's secondary index data domain stores the value of the primary key of the corresponding record rather than the address. In other words, all secondary indexes of InnoDB reference the primary key as the data domain. For example, Figure 3 shows a secondary index defined on Col3:

Here we use the ASCII code of English characters as a comparison criterion. Clustered index makes the search by primary key very efficient, but secondary index search requires two indexes: first, retrieve the secondary index to obtain the primary key, then, use the primary key to search for the record in the primary index.

Understanding the index implementation methods of different storage engines is very helpful for correct use and optimization of indexes. For example, after knowing the index Implementation of InnoDB, it is easy to understand why it is not recommended to use too long fields as the primary key, because all secondary indexes reference the primary index, too long primary index will make the secondary index too large. For example, it is not a good idea to use non-monotonous fields as the primary key in InnoDB, because the InnoDB data file itself is a B + Tree, non-monotonous primary keys will cause frequent split and adjustment of data files to maintain the features of B + Tree during the insertion of new records, which is very inefficient, using an auto-increment field as the primary key is a good choice.

There are many articles about the implementation of MySQL indexes. The above also refer to "data structure and algorithm principles behind MySQL indexes". Let's take a look at the indexing principles of Lucene.

Lucene index implementation

Lucene indexes are not organized by B + Tree, but inverted indexes. Lucene's inverted indexes consist of Term index, Team Dictionary, and Posting List.

If there is an inverted index (invertedindex), there will be a forward index (forwardindex). The forward index is the positive correspondence between the Document and its field Fields:

DocID

Name

Sex

Age

1

Jack

Male

18

2

Lucy

Female

17

3

Peter

Male

17

The inverted index is the relationship between the Field and the document that owns the Field:

Sex field:

Male

[1, 3]

Female

[2]

Age field:

18

[1]

17

[2, 3]

Jack, lucy or 17, 18 are called terms, and [1, 3] is the posting list. The Posting list is an int-type array that stores all document IDs that match a specific term. So What Are Term index and Term dictionary?

As shown above, assume that the name field has many terms, such:Carla, Sara, Elin, Ada, Patty, Kate, Selena

If a specific term is sorted in this order, it is very slow to find a specific term. Because the term is not sorted, You need to filter it all to find a specific term. After sorting, it becomes:Ada, Carla, Elin, Kate, Patty, Sara, Selena

In this way, you can use binary search to find the target term faster than full traversal. The way to organize these terms is the term dictionary, which means the Term dictionary. With Term dictionary, you can use a relatively small number of comparisons and disk reads to find the target. However, random read operations on disks are still very expensive. Therefore, it is necessary to cache some data into the memory as few as possible to read disks. However, the entire Term dictionary itself is too large to be fully stored in the memory. So we have the Term index. Term index is a bit like a dictionary of a large chapter table. For example:

Term at the beginning of ................ Xxx page

Term at the beginning of C ................ Xxx page

Term starting with E ................ Xxx page

If all the terms are English characters, it is possible that the term index is actually composed of 26 English two-dimensional tables. However, the actual situation is that the term may not all be English characters, and the term can be any byte array. In addition, 26 English characters may not always have equal terms for each character. For example, one term starting with x may not exist, and many terms starting with s may exist. The actual term index is a trie tree:

The example is A trie tree that contains "A", "to", "tea", "ted", "ten", "I", "in", and "inn. This tree does not contain all the terms. It contains some prefixes of the terms. You can use the term index to quickly locate an offset of the term dictionary, and then search for the offset from this position. With some compression technology (for more information, search for Lucene Finite State Transducers), the size of the Term index can be only a tenth of the size of all terms, this makes it possible to cache the entire term index with memory.

The overall effect is as follows:

From Term index to Term Dictionary, and then to Posting List, the process of querying the result through the keyword of a field is clearer, it is also easy to query and or intersection OR union through the Posting List with multiple keywords.

Comparing the principles of B + Tree indexes in MySQL, we can find that:

1) Lucene's Term index and Term Dictionary actually correspond to MySQL's B + Tree function, providing indexes for key words. Lucene's inverted index can be faster than MySQL's B-tree retrieval.

2) The Term index is stored in the memory in the form of FST (finite state transducers), which features very memory saving. Therefore, Lucene can search for a key keyword very quickly, while MySQL B + Tree needs to read disk comparison.

3) Term dictionary is stored in blocks on the disk. A block is compressed using a public prefix. For example, AB can be omitted if all words start with AB. In this way, Term dictionary can save more disk space than B-tree.

4) Lucene adopts different indexing methods for different data types. The above analysis is for the field as a string, such as for int, TrieIntField, and longitude and latitude, you can use GeoHash encoding.

5) indexes independently created for two fields in Mysql cannot be used together. composite indexes must be created for the scenario of joint queries. Lucene can use indexes in any and or combination for retrieval.

The above is all the content of the comparative analysis of MySQL and Lucene indexes provided by xiaobian. I hope it will be helpful to you and support more customers ~

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.