I,Partition Concept
PartitionAllows you to allocate multiple parts of a single table across file systems based on specified rules. Different parts of a table are stored as separate tables in different locations. MySQL supports Partition from 5.1.3.
Comparison between partitions and Manual table shards
Manual table sharding |
Partition |
Multiple Data Tables |
One data table |
Risks of duplicate data |
No data duplication risks |
Write multiple tables |
Write a table |
No unified Constraints |
Mandatory Constraints |
MySQL supports the RANGE, LIST, HASH, and KEY partition types, among which RANGE is the most commonly used:
- Range-this mode allows data to be divided into different ranges. For example, you can divide a table into Several partitions by year.
- Hash-this mode allows you to calculate the Hash Key of one or more columns of a table, and partition the data area corresponding to different values using this Hash code. For example, you can create a table that partitions the primary key of the table.
- Key (Key value)-an extension of the Hash mode above. The Hash Key here is generated by the MySQL system.
- List (pre-defined List)-This mode allows the system to split data by the pre-defined List value.
- Composite (Composite mode)-combination of the above modes
Ii. What can a partition do?
- Logical Data Segmentation
- Improve the speed of a single write and read Application
- Improve the speed of reading and querying by partition range
- Split data can have multiple physical file paths
- Efficiently saves historical data
- Check constraints on a table
- Different master/slave server partition policies, such as master partitioning by Hash and slave partitioning by range
Iii. Partition restrictions(Deadline5.1.44Version)
• Only integer columns in the data table can be partitioned, or data columns can be converted to integer columns through the partition function.
• The maximum number of partitions cannot exceed 1024
• If it contains a unique index or primary key, the partition column must be included in all unique indexes or primary keys.
• Foreign keys are not supported
• Full-text indexing (fulltext) is not supported)
- Partitioning by date is very suitable because many date functions can be used. However, there are not many suitable partition functions for strings.
4. When to use partitions
• Massive Data Tables
• You can use ARCHIVE + PARTITION to quickly query historical tables.
• The data table index is larger than the server's valid memory
• You do not need to use indexes for large tables, especially when the index is much larger than the server's effective memory. In this case, the partition efficiency will be more effective.
5. Partition Experiment
Lab 1:
Data published using US Bureau of Transportation Statistics (in CSV format). Currently, this includes 0.113 billion records and 7.5 GB Data 5.2 GB index. The time ranges from 1987 to 2007.
The server uses 4 GB memory, so the data and index size exceeds the memory size. The reason for setting it to 4 GB is that the data warehouse size is much larger than the possible memory size, which can be several TB. For general OLTP databases, indexes are cached in the memory and can be quickly searched. If the data exceeds the memory size, different methods are required.
Create a table with a primary key, because usually the table has a primary key. The index cannot be read into memory because the table's primary key is too large. This is generally not efficient, which means frequent access to the disk depends entirely on your disk and processor. At present, a common practice in designing a large data warehouse is not to use indexes. Therefore, it will also have performance with or without a primary key.
Test method:
Use three data engines: MyISAM, InnoDB, and Archive.
For each type of engine, create an unpartitioned table (except archive) with a primary key and two partitioned tables, one by month and one by year. Partition Table partitions are as follows:
Create table by_year (
D DATE
)
Partition by range (YEAR (d ))
(
PARTITION P1 values less than (2001 ),
PARTITION P2 values less than (2002 ),
PARTITION P3 values less than (2003 ),
PARTITION P4 values less than (MAXVALUE)
)
Create table by_month (
D DATE
)
Partition by range (TO_DAYS (d ))
(
PARTITION P1 values less than (to_days ('1970-02-01 '),-January
PARTITION P2 values less than (to_days ('2017-03-01 '),-February
PARTITION P3 values less than (to_days ('2017-04-01 '),-March
PARTITION P4 values less than (MAXVALUE)
)
Each instance is tested on a separate instance on the mysql server. Each instance has only one database and one table. Each engine starts the service, runs the query, records the results, and closes the service. The service instance is created through MySQL Sandbox.
Data is loaded as follows:
ID |
Leader |
Partition? |
Data |
Size |
Remarks |
Loading time (*) |
1 |
MyISAM |
None |
0.113 billion |
13 GB |
With PK |
37 min |
2 |
MyISAM |
By month |
0.113 billion |
8 GB |
Without PK |
19 min |
3 |
MyISAM |
By year |
0.113 billion |
8 GB |
Without PK |
18 min |
4 |
InnoDB |
None |
0.113 billion |
16 GB |
With PK |
63 min |
5 |
InnoDB |
By month |
0.113 billion |
10 GB |
Without PK |
59 min |
6 |
InnoDB |
By year |
0.113 billion |
10 GB |
Without PK |
57 min |
7 |
Archive |
None |
0.113 billion |
1.8 GB |
No keys |
20 min |
8 |
Archive |
By month |
0.113 billion |
1.8 GB |
No keys |
21 min |
9 |
Archive |
By year |
0.113 billion |
1.8 GB |
No keys |
20 min |
* On dual-Xeon servers
To compare the effect of partitions on large and small datasets, an additional nine instances are created, each of which contains data of less than 2 GB.
There are two types of query statements
Select count (*)
FROM table_name
WHERE date_column BETWEEN start_date and end_date
SELECT column_list
FROM table_name
WHERE column1 = x and column2 = y and column3 = z
For the first query, create statements with different date ranges. For each range, create an additional query with the same range date. The first query in each date range is a cold query, which means that it is the first hit, and the subsequent query in the same range is a warm query, which means that at least part of the query is cached. The query statement is on the Forge.
Result:
1Partition Table with primary key
The first test uses a composite primary key, just as the original data table uses. The primary key index file reaches 5.5 GB. It can be seen that the partition not only does not improve the performance, but also slows down the operation of the primary key. If the primary key index query is used, and the index cannot be read into the memory, the query performance is poor. It indicates that partitions are useful, but must be used properly.
+ --- + ------ +
| Status | myisam not partitioned | myisam monthly partition | myisam annual partition |
+ --- + ------ +
| Cold | 2.6574570285714 | 2.9169642 | 3.0373419714286 |
| Warm | 2.5720722571429 | 3.1249698285714 | 3.1294000571429 |
+ --- + ------ +
Introduction to ARCHIVE
+ --- + ------ +
| Status | archive not partitioned | archive monthly partition | archive annual partition |
+ --- + ------ +
| Cold | 249.849563 | 1.2436211111111 | 12.632532527778 |
| Warm | 235.814442 | 1.0889786388889 | 12.600520777778 |
+ --- + ------ +
Note that the response time of the ARCHIVE leading monthly partition is better than that of MyISAM.
2Partition Table without primary key
If the size of the primary key exceeds the available key buffer or even all the memory, all queries using the primary key will use the disk. The new method only uses partitions, instead of primary keys. Performance has been significantly improved.
The performance of Partitioned Tables by month is improved by 70%-90%.
+ --- + ------ +
| Status | myisam not partitioned | myisam monthly partition | myisam annual partition |
+ --- + ------ +
| Cold | 2.6864490285714 | 0.64206445714286 | 2.6343286285714 |
| Warm | 2.8157905714286 | 0.18774977142857 | 2.2084743714286 |
+ --- + ------ +
To make the difference more obvious, I used two large-scale queries, which can be used to eliminate partitions.
# Query 1-Annual Statistics
SELECT year (FlightDate) as y, count (*)
FROM flightstats
WHERE FlightDate BETWEEN "2001-01-01" and "2003-12-31 ″
Group by y
# Query 2-Monthly Statistics
SELECT date_format (FlightDate, "% Y-% m") as m, count (*)
FROM flightstats
WHERE FlightDate BETWEEN "2001-01-01" and "2003-12-31 ″
Group by m
The results show that the monthly partition table has a performance improvement of 30%-60%, and the annual partition table has a performance improvement of 15%-30%.
+ ---- +
| Query_id | no | month | year |
+ ---- +
| 1 | 97.779958 | 36.296519 | 82.327554 |
| 2 | 69.61055 | 47.644986 | 47.60223 |
+ ---- +
Processor Factors
When the above tests are performed on an Intel Dual Core 2.3 MHz CPU. For dual Xeon 2.66 MHz, It is faster to find new servers !.
Repeat the test above, surprising:
+ --- + ------- + ----- + ------ +
| Status | myisam not partitioned | myisam monthly partition | myisam annual partition |
+ --- + ------- + ----- + ------ +
| Cold | 0.051063428571429 | 0.6577062 | 1.6663527428571 |
| Warm | 0.063645485714286 | 0.1093724 | 1.2369152285714 |
+ --- + ------- + ----- + ------ +
Myisam tables with primary keys without partitions are faster than partition tables. The performance of a partition table is the same as that of the original table, but the performance of a non-partition table is improved, making partitions unnecessary. Since this server seems to take full advantage of the index, I have added an index to the partition column of the partition table.
# Original table
Create table flightstats (
AirlineID int not null,
UniqueCarrier char (3) not null,
Carrier char (3) not null,
FlightDate date not null,
FlightNum char (5) not null,
TailNum char (8) not null,
ArrDelay double not null,
ArrTime datetime not null,
DepDelay double not null,
DepTime datetime not null,
Origin char (3) not null,
Dest char (3) not null,
Distance int not null,
Cancelled char (1) default 'n ',
Primary key (FlightDate, AirlineID, Carrier, UniqueCarrier, FlightNum, Origin, DepTime, Dest)
)
# Partitioned table
Create table flightstats (
AirlineID int not null,
UniqueCarrier char (3) not null,
Carrier char (3) not null,
FlightDate date not null,
FlightNum char (5) not null,
TailNum char (8) not null,
ArrDelay double not null,
ArrTime datetime not null,
DepDelay double not null,
DepTime datetime not null,
Origin char (3) not null,
Dest char (3) not null,
Distance int not null,
Cancelled char (1) default 'n ',
KEY (FlightDate)
)
Partition by range...
The results were satisfactory, with 35% performance improvement.
+ --- + ------- +
| Status | myisam not partitioned | myisam monthly partition | myisam annual partition |
+ --- + ------- +
| Cold | 0.075289714285714 | 0.025491685714286 | 0.072398542857143 |
| Warm | 0.064401257142857 | 0.031563085714286 | 0.056638085714286 |
+ --- + ------- +
Conclusion:
1.Using table partitions does not guarantee performance improvement. It depends on the following factors:
- The column used for partitioning used by the partition;
- Partition Function, if the original field is not int type;
- Server speed;
- Memory size.
2.Run Benchmark Test and performance test before application to Production System
Depending on the purpose of your database, you may get a huge performance improvement or nothing. If you are not careful, it may even reduce the performance.
For example, a table that uses monthly partitions can get an excellent speed when querying the date range. However, if there is no date query, a full table scan is performed.
Partitioning is a key tool for improving the performance of massive data. What is massive data depends on the deployed hardware. The blind use of partitions cannot guarantee performance improvement, but with the help of the preliminary Benchmark Test and performance test, it can become a perfect solution.
3.ArchiveTable can be a good compromise
After the Archive table is partitioned, the performance can be greatly improved. Of course, it also depends on your purpose. If no partition is available, any query is a full table scan. If you have historical data that does not need to be changed and you need to perform time-based analysis and statistics, using Archive is an excellent choice. It uses 10-20% of the original space and provides better performance for clustered queries than MyISAM/InnoDB tables.
Although the performance of a Well-optimized partition MyISAM table may be better than the corresponding Archive table, it requires 10 times of space.
Lab 2:
1. Create two tables, one partitioned by the time field and the other non-partitioned.
Create table part_tab
(
C1 int default NULL,
C2 varchar (30) default NULL,
C3 date default NULL
) Engine = myisam
Partition by range (year (c3) (PARTITION p0 values less than (1995 ),
PARTITION p1 values less than (1996), PARTITION p2 values less than (1997 ),
PARTITION p3 values less than (1998), PARTITION p4 values less than (1999 ),
PARTITION p5 values less than (2000), PARTITION p6 values less than (2001 ),
PARTITION p7 values less than (2002), PARTITION p8 values less than (2003 ),
PARTITION p9 values less than (2004), PARTITION p10 values less than (2010 ),
PARTITION p11 values less than maxvalue );
Create table no_part_tab
(C1 int (11) default NULL,
C2 varchar (30) default NULL,
C3 date default NULL) engine = myisam;
2. Create a stored procedure to insert 8 million different data records into two tables.
Delimiter //
Create procedure load_part_tab ()
Begin
Declare v int default 0;
While v< 8000000
Do
Insert into part_tab
Values (v, 'testing partitions ', adddate ('1970-01-01', (rand (v) * 1995) mod 36520 ));
Set v = v + 1;
End while;
End
//
Then execute
Mysql> delimiter;
Mysql> call load_part_tab ();
Query OK, 1 row affected (8 min 17.75 sec)
Mysql> insert into no_part_tab select * from part_tab;
Query OK, 8000000 rows affected (51.59 sec)
Records: 8000000 Duplicates: 0 Warnings: 0
3. Start to query the data in the two tables in a simple range. The execution process Parsing is displayed:
Mysql> select count (*) from no_part_tab where c3> date '2017-01-01 'and c3 <date '2017-12-31 ′;
+ ---- +
| Count (*) |
+ ---- +
| 1, 795181 |
+ ---- +
1 row in set (38.30 sec)
Mysql> select count (*) from part_tab where c3> date '2017-01-01 'and c3 <date '2017-12-31 ′;
+ ---- +
| Count (*) |
+ ---- +
| 1, 795181 |
+ ---- +
1 row in set (3.88 sec)
Mysql> explain select count (*) from no_part_tab where c3> date '2017-01-01 'and c3 <date '2017-12-31' \ G
* *************************** 1. row ***************************
Id: 1
Select_type: SIMPLE
Table: no_part_tab
Type: ALL
Possible_keys: NULL
Key: NULL
Key_len: NULL
Ref: NULL
Rows: 8000000
Extra: Using where
1 row in set (0.00 sec)
Mysql> explain partitions select count (*) from part_tab where
-> C3> date '2017-01-01 'and c3 <date '2017-12-31' \ G
* *************************** 1. row ***************************
Id: 1
Select_type: SIMPLE
Table: part_tab
Partitions: p1
Type: ALL
Possible_keys: NULL
Key: NULL
Key_len: NULL
Ref: NULL
Rows: 798458
Extra: Using where
1 row in set (0.00 sec)
From the above results, we can see that the use of table partitions reduces the response time by 90% than that of non-partitions. The Command Parsing Explain program can be seen that only the first partition is scanned during the query of the partitioned table, and the rest are skipped. Further test:
-Added the date range.
Mysql> select count (*) from no_part_tab where c3> date '-01-01' and c3 <date '2017-12-31 ′;
+ ---- +
| Count (*) |
+ ---- +
| 1, 2396524 |
+ ---- +
1 row in set (5.42 sec)
Mysql> select count (*) from part_tab where c3> date '-01-01' and c3 <date '2017-12-31 ′;
+ ---- +
| Count (*) |
+ ---- +
| 1, 2396524 |
+ ---- +
1 row in set (2.63 sec)
-Added the query for unindexed fields.
Mysql> select count (*) from part_tab where c3> date '-01-01' and c3 <date
'2017-12-31 'and c2 = 'hello ';
+ ---- +
| Count (*) |
+ ---- +
| 0 |
+ ---- +
1 row in set (0.75 sec)
Mysql> select count (*) from no_part_tab where c3> date '-01-01' and c3 <da
Te '2017-12-31 'and c2 = 'hello ';
+ ---- +
| Count (*) |
+ ---- +
| 0 |
+ ---- +
1 row in set (11.52 sec)
Conclusion:
- Partition and unpartitioned file space are roughly the same (data and index files)
- If no index field is created in the query statement, the partition time is far better than the time when no partition is created.
- If the fields in the query statement are indexed, the difference between partitions and unpartitioned partitions is reduced, and the partitions are slightly better than those without partitions.
- We recommend that you use the Partition Function for large data volumes.
- Remove unnecessary Fields
- According to the manual, adding myisam_max_sort_file_size will increase the partition performance.