If a large table contains millions, tens of millions, or even hundreds of millions of data records, the query results should be queried in real time within 10 seconds. What should I do? How to optimize it?
In my current project, there is a table with more than 10 million rows of data and more than 3 GB of data. Now we need to perform query statistics on the data in the table. Previously, the query efficiency of this table was very low due to No optimization, which made the user very worried. So I participated in the optimization of this table.
For example, if the table structure is as follows, we need to calculate the number of people born on a day, the number of people in a city, or the number of people born on a day in a city.
Create Table 'population' ('Population _ id' bigint (64) not null auto_increment comment 'Population table', 'name' varchar (128) Collate utf8_bin default null comment' name ', 'city' varchar (32) Collate utf8_bin default null comment 'city', 'birthday' date default null comment 'birthday', primary key ('Population _ id ')) select count (*) from population where city = 'guangzhou 'and Birthday = '2017-11-02' query the population of a city select count (*) from population where city = 'guangzhou 'query the number of people born on a day select count (*) from population where Birthday = '2017-11-02'
Two optimization schemes are proposed,
(1). Optimize the index
After an index is added, the query efficiency is greatly improved. The query time of a common query is reduced from dozens of seconds to several seconds.
Create the following two single-column Indexes
ALTER TABLE `population` ADD INDEX `fk_city` (`city`), ADD INDEX `fk_birthday` (`birthday`);
You can also create the following two composite indexes:
ALTER TABLE `population` ADD INDEX `fk_index1` (`city`, `birthday`), ADD INDEX `fk_index2` (`birthday`, `city`);
(2). Use an intermediate table
Although index optimization can greatly reduce the query time, if the amount of data reaches a certain level, in some cases, when the index data reaches several millions, the query will still be slow, therefore, index optimization cannot fundamentally solve the problem. Currently, the amount of data in a table is getting larger and larger, with an average increase of 1 millions or 2 millions per month. The index optimization method is only temporary and can only solve the query problem of small data volumes. As the data volume increases rapidly, the performance optimization brought by indexes can easily reach the limit. You need to find other solutions.
We create an intermediate table population_statistics based on the characteristics of our business needs and store the statistical data in the population table in the middle table population_statistics. During the query, we directly query the data from the intermediate table population_statistics. Note: When adding, deleting, and modifying a table population, you must update the data in population_statistics at the same time. Otherwise, the data inconsistency error may occur!
Create Table 'Population _ STATISTICS '('Population _ statistics_id' bigint (64) not null auto_increment comment 'Population statistics table id', 'city' varchar (128) collate utf8_bin default null comment 'city', 'birthday' int (32) default null comment 'birthday', 'total _ count' int (32) default null comment 'Population qty ', primary key ('Population _ statistics_id'), key 'fk _ City' ('city '), key 'fk _ birthday' ('birthday') query the population of a city born on a day. Select total_count from population_statistics where city = 'guangzhou 'and Birthday = '2017-11-02 '; query the population of a city select count (total_count) from population_statistics where city = 'guangzhou '; query the population of a day select count (total_count) from population_statistics where Birthday = '2017-11-02 ';
The population of a city in a certain day may have thousands or even tens of thousands of data in the population_statistics table, while the statistical table population_statistics has only one data entry at most, that is to say, the data volume in the statistical table population_statistics is only 1‰ of the Population Table population. In addition, the index optimization greatly improves the query speed.
Below is a summary of common big data table optimization solutions.
1. INDEX OPTIMIZATION
By establishing a reasonable and efficient index, the query speed is improved.
I recommend that you read an index blog.
Http://blog.csdn.net/brushli/article/details/39677387
2. SQL Optimization
Organize and optimize SQL statements to optimize the query efficiency. In many cases, the index function should be taken into account.
I suggest you read an index blog written by myself.
Http://blog.csdn.net/brushli/article/details/39677387
3. Horizontal table sharding
If the data in a table presents a certain type of features, such as the time feature, you can split the table into multiple tables based on the time period.
For example, the query results are divided by year, quarter, and month. The query results are split by time period and then merged;
For example, you can split a table by region. Data in different regions is placed in different tables. Then, you can split the query and merge the query results.
4. Vertical table splitting
Splits a table into multiple tables by field. Common fields are placed in one table, and infrequently used fields or large fields are placed in another table. Because each database query is a block, and the capacity of each block is limited, usually dozens of K or dozens of K, after the table is split by field, the number of rows that can be retrieved by a single Io is usually much higher, and the query efficiency can be improved.
For example, a Member table has the following structure:
Create Table 'member' ('Member _ id' bigint (64) not null auto_increment comment 'Member table id', 'name' varchar (128) collate utf8_bin default null comment 'Member name', 'age' int (32) default null comment 'Member age', 'inserted' text collate utf8_bin comment' member introduction ', primary Key ('Member _ id '))
Introduction is a large field that saves the introduction of Members. This large field seriously affects the query efficiency. You can separate it to form a separate table.
Create Table 'member' ('Member _ id' bigint (64) not null auto_increment comment 'Member table id', 'name' varchar (128) collate utf8_bin default null comment 'Member name', 'age' int (32) default null comment 'Member age', primary key ('Member _ id ')) create Table 'Member _ introduction '('Member _ introduction_id' bigint (64) not null auto_increment comment 'Member description table id', 'member _ id' bigint (64) default null comment 'Member id', 'introduction 'text collate utf8_bin comment' member introduction', primary key ('Member _ introduction_id '), key 'fk _ member_id '('Member _ id'), constraint 'fk _ member_id' foreign key ('Member _ id ') references 'member' ('Member _ id '))
5. Create an intermediate table and change the space time
In some cases, you can create an intermediate table to speed up the query. For more information, see the example at the beginning of this article.
6. Use the memory to cache data and change the space for time
Load frequently-used and infrequently-modified data into the memory and directly query data from the memory.
Popular cache technologies such as memcache, redis, and ehcache can be used.
7. Use other auxiliary technologies
SOLR: A Lucene-based Java Search Engine Technology
Query optimization solution for Big Data Tables