Mysql super large data/table management skills

Source: Internet
Author: User

If you are not interested in the long article, you can also look at the results directly. Maybe you are interested in the results. In practical applications, after storage and optimization, the query response speed can be controlled within 1 to 20 milliseconds for over million data records. It seems to be a good result, but there is no end to optimizing this path. When our system is used by more than a few hundred people and thousands of people at the same time, it will still be far from enough.

Directory:

Partitioned Storage
Optimize Query
Improve Partition
Fuzzy search
Continuous Improvement Plan

Body:

Partitioned Storage
For ultra-large data, partition storage is a good choice, or this is a required option. In this example, the data record sources are different. You can divide the data based on the source. However, this is not enough because the data in each source partition may exceed 10 million. The storage and query of data is too large. MySQL5.x and later have better support for data partitions and subpartitions. Therefore, data is stored using partitions and subpartitions.

The basic data structure definition is as follows:

Copy codeThe Code is as follows:
Create table 'tmp _ sampledata '(
'Id' bigint (20) unsigned not null AUTO_INCREMENT,
'Username' varchar (32) default null,
'Passwd' varchar (32) default null,
'Email 'varchar (64) default null,
'Nickname' varchar (32) default null,
'Siteid' varchar (32) default null,
'Src' smallint (6) not null default '0 ′,
Primary key ('id', 'src ')
) ENGINE = MyISAM AUTO_INCREMENT = 95660181 default charset = gbk
/*! 50500 partition by list columns (src)
Subpartition by hash (id)
SUBPARTITIONS 5
(PARTITION pose values in (1) ENGINE = MyISAM,
PARTITION p2736 values in (2) ENGINE = MyISAM,
PARTITION p738536 values in (3) ENGINE = MyISAM,
PARTITION interval limit 48 values in (4) ENGINE = MyISAM,
PARTITION p842692 values in (5) ENGINE = MyISAM,
PARTITION p7575 values in (6) ENGINE = MyISAM,
PARTITION p0000386 values in (7) ENGINE = MyISAM,
PARTITION p62678 values in (8) ENGINE = MyISAM )*/

For data tables with partitions and subpartitions, data columns used in partition conditions (including subpartition conditions) should be defined in primary key or unique key. For detailed partition definition formats, see the MySQL documentation. The above structure is the storage method of the first draft (which will be modified later ). The load data infile method is used to load tens of millions of records in 30 minutes. It seems fast (bulk_insert_buffer_size = 8 m ).
Basic Query Optimization
After the data is loaded, we test a query:

Copy codeThe Code is as follows:
Mysql> explain select * from tmp_sampledata where id = 9562468 \ G
* *************************** 1. row ***************************
Id: 1
Select_type: SIMPLE
Table: tmp_sampledata
Type: ref
Possible_keys: PRIMARY
Key: PRIMARY
Key_len: 8
Ref: const
Rows: 8
Extra:
1 row in set (0.00 sec)

There is no doubt that the primary key is used for querying by id, and the query speed will be very fast. However, such an approach is meaningless. Because it is impossible for end users to know the id of any data. If you want to query by username:

Copy codeThe Code is as follows:
Mysql> explain select * from tmp_sampledata where username = 'yourusername' \ G
* *************************** 1. row ***************************
Id: 1
Select_type: SIMPLE
Table: tmp_sampledata
Type: ALL
Possible_keys: NULL
Key: NULL
Key_len: NULL
Ref: NULL
Rows: 74352359
Extra: Using where
1 row in set (0.00 sec)

Mysql> explain select * from tmp_sampledata where src between 1 and 7 and username = 'yourusername' \ G
* *************************** 1. row ***************************
Id: 1
Select_type: SIMPLE
Table: tmp_sampledata
Type: ALL
Possible_keys: NULL
Key: NULL
Key_len: NULL
Ref: NULL
Rows: 74352359
Extra: Using where
1 row in set (0.00 sec)

Then this query cannot be used. No one can wait for a full table search for hundreds of millions of tables! This is to consider whether to create an index for username, which will certainly increase the query speed:

Create index idx_username on tmp_sampledata (username );

This index has been created for a long time and seems to have exceeded the data loading time.

Copy codeThe Code is as follows:
Mysql> explain select * from tmp_sampledata2 where username = 'yourusername' \ G
* *************************** 1. row ***************************
Id: 1
Select_type: SIMPLE
Table: tmp_sampledata2
Type: ref
Possible_keys: idx_username
Key: idx_username
Key_len: 66
Ref: const
Rows: 80
Extra: Using where
1 row in set (0.00 sec)

As expected, this query uses an index and the query speed is within the acceptable range.
However, this brings about another problem: Creating an index requires extra space !! When we create indexes for both username and email, the space usage is greatly improved! This is also not what we expect (helpless choice ?).

In addition to using indexes and ensuring that they can be used in queries, the keyword field of a partition is an important optimization factor, for example, the following example:

Copy codeThe Code is as follows:
Mysql> explain select id from tsampledata where username = 'abcdef' \ G
* *************************** 1. row ***************************
Id: 1
Select_type: SIMPLE
Table: tsampledata
Type: ref
Possible_keys: idx_sampledata_username
Key: idx_sampledata_username
Key_len: 66
Ref: const
Rows: 80
Extra: Using where
1 row in set (0.00 sec)

Mysql> explain select id from tsampledata where username = 'abcdef' and src in (,) \ G
* *************************** 1. row ***************************
Id: 1
Select_type: SIMPLE
Table: tsampledata
Type: ref
Possible_keys: idx_sampledata_username
Key: idx_sampledata_username
Key_len: 66
Ref: const
Rows: 40
Extra: Using where
1 row in set (0.01 sec)

Mysql> explain select id from tsampledata where username = 'abcdef' and src in (2) \ G
* *************************** 1. row ***************************
Id: 1
Select_type: SIMPLE
Table: tsampledata
Type: ref
Possible_keys: idx_sampledata_username
Key: idx_sampledata_username
Key_len: 66
Ref: const
Rows: 10
Extra: Using where
1 row in set (0.00 sec)

Mysql> explain select id from tsampledata where username = 'abcdef' and src in (2, 3) \ G
* *************************** 1. row ***************************
Id: 1
Select_type: SIMPLE
Table: tsampledata
Type: ref
Possible_keys: idx_sampledata_username
Key: idx_sampledata_username
Key_len: 66
Ref: const
Rows: 20
Extra: Using where
1 row in set (0.00 sec)

When a query statement is used to query partitions, the query costs vary greatly:

Where username = 'abcdef' rows: 80
Where username = 'abcdef' and src in (2, 3, 4, 5) rows: 40
Where username = 'abcdef' and src in (2) rows: 10
Where username = 'abcdef' and src in (2, 3) rows: 20

From the analysis, we can see that when you query the limit based on src (partition field of the Partition Table), the number of affected rows changes. Rows: 80 indicates that you need to search for eight partitions.
Improved data storage: Another partition format
In statistical applications, username and email are most commonly used for data query. In table Storage, username and email should be used for partitioning instead of id. Therefore, re-create a partition table and import data:

Copy codeThe Code is as follows:
Create table 'tmp _ sampledata '(
'Id' bigint (20) unsigned not null,
'Username' varchar (32) not null default ",
'Passwd' varchar (32) default null,
'Email 'varchar (64) not null default ",
'Nickname' varchar (32) default null,
'Siteid' varchar (32) default null,
'Src' smallint (6) not null default '0 ′,
Primary KEY ('src', 'username', 'email ', 'id ')
) ENGINE = MyISAM default charset = gbk
Partition by list columns (src)
Subpartition by key (username, email)
SUBPARTITIONS 10
(PARTITION pose values in (1) ENGINE = MyISAM,
PARTITION p2736 values in (2) ENGINE = MyISAM,
PARTITION p738536 values in (3) ENGINE = MyISAM,
PARTITION interval limit 48 values in (4) ENGINE = MyISAM,
PARTITION p842692 values in (5) ENGINE = MyISAM,
PARTITION p7575 values in (6) ENGINE = MyISAM,
PARTITION p0000386 values in (7) ENGINE = MyISAM,
PARTITION p62678 values in (8) ENGINE = MyISAM )?;

This definition is fine. As expected, it partitions data tables based on the primary key. However, this has a very serious performance problem: when loading data infile, you can also create indexes for the data. This greatly prolongs the data loading time, which is also intolerable. In the preceding example, if primary key or unique key is enabled during table creation, load data infile has been executed for more than 12 hours on my test system. And the following:

Copy codeThe Code is as follows:
Create table 'tmp _ sampledata '(
'Id' bigint (20) unsigned not null,
'Username' varchar (32) not null default ",
'Passwd' varchar (32) default null,
'Email 'varchar (64) not null default ",
'Nickname' varchar (32) default null,
'Siteid' varchar (32) default null,
'Src' smallint (6) not null default '0 ′
) ENGINE = MyISAM default charset = gbk
Partition by list columns (src)
Subpartition by key (username, email)
SUBPARTITIONS 10
(PARTITION pose values in (1) ENGINE = MyISAM,
PARTITION p2736 values in (2) ENGINE = MyISAM,
PARTITION p738536 values in (3) ENGINE = MyISAM,
PARTITION interval limit 48 values in (4) ENGINE = MyISAM,
PARTITION p842692 values in (5) ENGINE = MyISAM,
PARTITION p7575 values in (6) ENGINE = MyISAM,
PARTITION p0000386 values in (7) ENGINE = MyISAM,
PARTITION p62678 values in (8) ENGINE = MyISAM )?;

It takes only 5 minutes to load data:
Mysql> load data infile 'cvsfile.txt 'into table tmp_sampledata fields terminated by' \ t' escaped ";
Query OK, 74352359 rows affected, 65535 warnings (5 min 23.67 sec)
Records: 74352359 Deleted: 0 Skipped: 0 Warnings: 51267046

So, all the problems are back to 2.
Fuzzy search in test Query
For large data tables that have created indexes, generally targeted queries can meet the needs. However, some queries may not be efficient through indexing, such as querying mailboxes ending with 163.com:

Select... From... Where email like '% 163.com'

Even if the data has an index for the email, the index cannot be used for the above query. If we use oracle, we can create a reverse index, but mysql does not support reverse indexes. Therefore, if a similar query occurs, there are only two solutions:
Through data redundancy, the required fields are reversed and saved separately, and an index is created.
In this way, the above query can be completed through where email like 'moc. 361% ', but the cost (storage and update) is too high.
Full-text retrieval is implemented through fulltext. However, mysql does not support fulltext in partitioned tables (maybe wait for a later version .)
Perform word segmentation by yourself fulltext
No final solution

Create a partition table without any indexes or keys;
Import data;
Create an index;

It takes a long time to create an index. Here we make a small adjustment to increase the sorting space of the myisam index to 1 GB (8 MB by default ):

Mysql> set myisam_sort_buffer_size = 1048576000;
Query OK, 0 rows affected (0.00 sec)

Mysql> create index idx_username_src on tmp_sampledata (username, src );
Query OK, 74352359 rows affected (7 min 13.11 sec)
Records: 74352359 Duplicates: 0 Warnings: 0

Mysql> create index idx_email_src on tmp_sampledata (email, src );
Query OK, 74352359 rows affected (10 min 48.30 sec)
Records: 74352359 Duplicates: 0 Warnings: 0

Mysql> create index idx_src_username_email on tmp_sampledata (src, username, email );
Query OK, 74352359 rows affected (16 min 5.35 sec)
Records: 74352359 Duplicates: 0 Warnings: 0

In actual application, this table may be created without so many indexes, just to show the creation speed.
Results in actual application
The storage problem has been solved for now. Next, we have gone through a series of server parameter adjustments and query optimizations, I can only control the query response speed of over million data in 1 to 20 milliseconds. It sounds like a good result. However, when our system is used by more than a few hundred people at the same time, it is still far from enough. There may be opportunities to optimize the storage and query in the future. Let me look forward to it.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.