MySQL Super large data/table management skills _mysql

Source: Internet
Author: User
Tags create index

If you are not interested in long speeches, you can also see the results directly, perhaps you are interested in the results. In practical applications, storage, optimization can be done in more than 90 million of the query response speed control in 1 to 20 milliseconds. It seems to be a good result, but the road to optimization is not the end, when our system has more than hundreds of people, thousands of people use at the same time, will still show the powerless.

Directory:

Partitioned storage
Optimizing queries
Improve zoning
Fuzzy search
Programmes for continuous improvement

Body:

Partitioned storage
Partitioning storage is a good choice for oversized data, or it's a required option. For this example, the data logging originates from different sources, which can be divided first by source. But that's not enough, because each source's partition can be more than tens of millions of data. This is still too large for storing and querying data. Mysql5.x has been better supported by data partitioning and sub partitions. Therefore, the data is stored by partitioning + sub partitions.

Here is the basic data structure definition:

Copy Code code as follows:

CREATE TABLE ' Tmp_sampledata ' (
' ID ' bigint unsigned not NULL auto_increment,
' username ' varchar DEFAULT NULL,
' passwd ' varchar DEFAULT NULL,
' Email ' varchar DEFAULT NULL,
' Nickname ' varchar DEFAULT NULL,
' SiteID ' varchar DEFAULT NULL,
' src ' smallint (6) Not NULL DEFAULT ' 0′,
PRIMARY KEY (' id ', ' src ')
) Engine=myisam auto_increment=95660181 DEFAULT CHARSET=GBK
/*!50500 PARTITION by LIST COLUMNS (SRC)
Subpartition by HASH (ID)
Subpartitions 5
(PARTITION pose VALUES in (1) ENGINE = MyISAM,
PARTITION p2736 VALUES in (2) ENGINE = MyISAM,
PARTITION p736736 VALUES in (3) ENGINE = MyISAM,
PARTITION p3838648 VALUES in (4) ENGINE = MyISAM,
PARTITION p842692 VALUES in (5) ENGINE = MyISAM,
PARTITION p7575 VALUES in (6) ENGINE = MyISAM,
PARTITION p386386 VALUES in (7) ENGINE = MyISAM,
PARTITION p62678 VALUES in (8) ENGINE = MyISAM) * *

Data columns used in partition conditions (including child partition conditions) should be defined in the primary key or the unique key for data tables that have partitions and sub partitions. Detailed partition definition format, you can refer to the MySQL documentation. The above structure is the first version of the storage method (the following text will be modified). Loaded with load data infile, loading 80 million records in 30 minutes. It's still pretty fast (bulk_insert_buffer_size=8m).
Basic query optimization
After the data was loaded, we tested a query:

Copy Code code as follows:

Mysql> Explain select * from Tmp_sampledata where id=9562468\g
1. Row ***************************
Id:1
Select_type:simple
Table:tmp_sampledata
Type:ref
Possible_keys:primary
Key:primary
Key_len:8
Ref:const
Rows:8
Extra:
1 row in Set (0.00 sec)

There is no doubt that querying by ID uses a primary key, and the query is fast. But such an approach makes little sense. Because for the end user, it is impossible to know the ID of any data. If you need to follow the username query:

Copy Code code as follows:

Mysql> Explain select * from tmp_sampledata where username = ' YourUserName ' \g
1. Row ***************************
Id:1
Select_type:simple
Table:tmp_sampledata
Type:all
Possible_keys:null
Key:null
Key_len:null
Ref:null
rows:74352359
Extra:using where
1 row in Set (0.00 sec)

Mysql> Explain select * from Tmp_sampledata where src between 1 and 7 and username = ' YourUserName ' \g
1. Row ***************************
Id:1
Select_type:simple
Table:tmp_sampledata
Type:all
Possible_keys:null
Key:null
Key_len:null
Ref:null
rows:74352359
Extra:using where
1 row in Set (0.00 sec)

Then this query won't work. No one can wait for a full table search on billions of watches! This is the time to consider whether to create an index for username, which will definitely improve the query speed:

Create INDEX Idx_username on tmp_sampledata (username);

This is a long time to create an index, which seems to exceed the data load time, but at least it is built.

Copy Code code as follows:

        mysql> explain select * FROM TMP_SAMPLEDATA2 where username = ' yourusername ' \g
        ******************* 1. Row ***************************
        id:1
         select_type:simple
        table:tmp_sampledata2
        type:ref
        Possible_keys: Idx_username
        key:idx_username
         key_len:66
        ref:const
         rows:80
        extra:using where
   & nbsp;    1 row in Set (0.00 sec)

As expected, the query uses an index, and the query speed is within an acceptable range.
But this poses another problem: Creating an index requires space!! When we create indexes on username and email, the use of space is greatly improved! This is also not what we expect to see (helpless choice?) )。

In addition to using indexes and ensuring that they can be used in queries, the key field of a partition is an important optimization factor, such as the following example:

Copy Code code as follows:

Mysql> explain select id from tsampledata where username= ' abcdef ' \g
1. Row ***************************
Id:1
Select_type:simple
Table:tsampledata
Type:ref
Possible_keys:idx_sampledata_username
Key:idx_sampledata_username
key_len:66
Ref:const
Rows:80
Extra:using where
1 row in Set (0.00 sec)

        mysql> explain select id from tsampledata where username= ' abcdef ' and SRC in (2,3,4,5) \g
        *************************** 1. Row ****************
        id:1
        Select_type:simple
        table:tsampledata
         type:ref
        possible_keys:idx_sampledata_username
        key:idx_sampledata_username
         key_len:66
        ref:const
         rows:40
        extra:using where
         1 row in Set (0.01 sec)

        mysql> explain select id from tsampledata where username= ' abcdef ' and SRC in (2) \g
        *************************** 1 row **********************
        id:1
        select_ Type:simple
        table:tsampledata
         type:ref
        possible_keys:idx_sampledata_username
        Key:idx_sampledata_username
         key_len:66
        ref:const
         rows:10
        extra:using where
         1 row in Set (0.00 sec)

Mysql> explain select id from tsampledata where username= ' abcdef ' and src in (2,3) \g
1. Row ***************************
Id:1
Select_type:simple
Table:tsampledata
Type:ref
Possible_keys:idx_sampledata_username
Key:idx_sampledata_username
key_len:66
Ref:const
Rows:20
Extra:using where
1 row in Set (0.00 sec)

    the query cost varies significantly when the same query is made based on whether or not the query is limited to a partition:

        where username= ' abcdef '                                                        rows:80
        where Username= ' abcdef ' and SRC in (2,3,4,5)             rows:40
        where username= ' abcdef ' and Src in (2)                          Rows:10
        where username= ' abcdef ' and src in (2,3)                      rows:20

It is seen from the analysis that the number of affected (rows) is changing when the query is limited according to the SRC (partition field of the partitioned table). The rows:80 represents the need to search for 8 partitions.
Improving data storage: Another partition format
Since in the statistical application, most uses is through the username, the email carries on the data query, then should consider using the Username,email to partition when the table stores, but does not pass the ID. So recreate the partition table and import the data:

Copy Code code as follows:

CREATE TABLE ' Tmp_sampledata ' (
' ID ' bigint (m) unsigned not NULL,
' username ' varchar not NULL DEFAULT ',
' passwd ' varchar DEFAULT NULL,
' Email ' varchar not NULL DEFAULT ',
' Nickname ' varchar DEFAULT NULL,
' SiteID ' varchar DEFAULT NULL,
' src ' smallint (6) Not NULL DEFAULT ' 0′,
Primary KEY (' src ', ' username ', ' email ', ' id ')
) Engine=myisam DEFAULT CHARSET=GBK
PARTITION by LIST COLUMNS (SRC)
Subpartition by KEY (Username,email)
Subpartitions 10
(PARTITION pose VALUES in (1) ENGINE = MyISAM,
PARTITION p2736 VALUES in (2) ENGINE = MyISAM,
PARTITION p736736 VALUES in (3) ENGINE = MyISAM,
PARTITION p3838648 VALUES in (4) ENGINE = MyISAM,
PARTITION p842692 VALUES in (5) ENGINE = MyISAM,
PARTITION p7575 VALUES in (6) ENGINE = MyISAM,
PARTITION p386386 VALUES in (7) ENGINE = MyISAM,
PARTITION p62678 VALUES in (8) ENGINE = MyISAM)?;

There is nothing wrong with this definition, as expected, it will be based on the primary key to the data table partition. But there's a very, very serious performance problem: The data is indexed at the time of load data infile. This greatly prolongs the data loading time, and is equally intolerable. The above example, if the primary key or the unique key is enabled on my test system, the load data infile executes for more than 12 hours. And here's the following:

Copy Code code as follows:

CREATE TABLE ' Tmp_sampledata ' (
' ID ' bigint (m) unsigned not NULL,
' username ' varchar not NULL DEFAULT ',
' passwd ' varchar DEFAULT NULL,
' Email ' varchar not NULL DEFAULT ',
' Nickname ' varchar DEFAULT NULL,
' SiteID ' varchar DEFAULT NULL,
' src ' smallint (6) Not NULL DEFAULT ' 0′
) Engine=myisam DEFAULT CHARSET=GBK
PARTITION by LIST COLUMNS (SRC)
Subpartition by KEY (Username,email)
Subpartitions 10
(PARTITION pose VALUES in (1) ENGINE = MyISAM,
PARTITION p2736 VALUES in (2) ENGINE = MyISAM,
PARTITION p736736 VALUES in (3) ENGINE = MyISAM,
PARTITION p3838648 VALUES in (4) ENGINE = MyISAM,
PARTITION p842692 VALUES in (5) ENGINE = MyISAM,
PARTITION p7575 VALUES in (6) ENGINE = MyISAM,
PARTITION p386386 VALUES in (7) ENGINE = MyISAM,
PARTITION p62678 VALUES in (8) ENGINE = MyISAM)?;

The data load took only 5 minutes:
mysql> load Data infile ' cvsfile.txt ' into the table Tmp_sampledata fields terminated by ' \ t ' escaped by ';
Query OK, 74352359 rows affected, 65535 warnings (5 min 23.67 sec)
records:74352359 deleted:0 skipped:0 warnings:51267046

So, all the questions are back to 2. On
Fuzzy search in a test query
For a large data table that creates a good index, a generic, targeted query should be able to meet your needs. However, some queries may not be efficient through indexing, such as querying mailboxes at the end of 163.com:

Select ... from ... where email like '%163.com '

Even if the data is indexed for email, the query above does not use that index. If we are using Oracle, we can also establish a reverse index, but MySQL does not support reverse indexing. So if a similar query occurs, only two scenarios can:
With data redundancy, reverse the required fields and save them, and create an index
The above query can be done through where email like ' moc.361% ', but the cost (storage, update) is too high
Through the Full-text search fulltext to realize. However, MySQL also does not support Fulltext on the partition table (perhaps waiting for a later version.) )
I do participle fulltext
No Final Solution

Create a partitioned table without any indexes or keys;
Import data;
Create an index;

Because it takes a long time to create an index, a small adjustment is made here to improve the sorting space of the MyISAM index by 1G (default is 8m):

Mysql> set myisam_sort_buffer_size=1048576000;
Query OK, 0 rows Affected (0.00 sec)

Mysql> CREATE index IDX_USERNAME_SRC on Tmp_sampledata (USERNAME,SRC);
Query OK, 74352359 rows affected (7 min 13.11 sec)
records:74352359 duplicates:0 warnings:0

Mysql> CREATE index IDX_EMAIL_SRC on Tmp_sampledata (EMAIL,SRC);
Query OK, 74352359 rows affected (min 48.30 sec)
records:74352359 duplicates:0 warnings:0

Mysql> CREATE index Idx_src_username_email on Tmp_sampledata (Src,username,email);
Query OK, 74352359 rows affected (min 5.35 sec)
records:74352359 duplicates:0 warnings:0

In practice, this table may not need so many indexes, all built again, just to show the speed of creation.
The effect in practical application
The storage problem is temporarily resolved here, followed by a series of server parameter tuning and query optimization, I can only do in this more than 90 million data in the query response speed control in 1 to 20 milliseconds. That sounds a good score. But when our system has more than hundreds of people to use at the same time, it is still very inadequate. There may be opportunities to optimize the store and query in the future. Let me slowly look forward to it.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.