Big Data Table query optimization solution and data table optimization solution

Last Update:2014-11-03 Source: Internet

Author: User

Tags database sharding

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Big Data Table query optimization solution and data table optimization solution

If a large table contains millions, tens of millions, or even hundreds of millions of data records, the query results should be queried in real time within 10 seconds. What should I do? How to optimize it?

In my current project, there is a table with more than 10 million rows of data and more than 3 GB of data. Now we need to perform query statistics on the data in the table. Previously, the query efficiency of this table was very low due to No optimization, which made the user very worried. So I participated in the optimization of this table.

For example, if the table structure is as follows, we need to calculate the number of people born on a day, the number of people in a city, or the number of people born on a day in a city.

Create table 'population' ('Population _ id' bigint (64) not null AUTO_INCREMENT COMMENT 'Population table', 'name' varchar (128) COLLATE utf8_bin default null comment' name ', 'city' varchar (32) COLLATE utf8_bin default null comment 'city', 'birthday' date default null comment 'birthday', primary key ('Population _ id ')) select count (*) FROM population WHERE city = 'guangzhou 'AND birthday = '2017-11-02' query the population of a city select count (*) FROM population WHERE city = 'guangzhou 'query the number of people born on a day select count (*) FROM population WHERE birthday = '2017-11-02'

Two optimization schemes are proposed,

(1). Optimize the index

After an index is added, the query efficiency is greatly improved. The query time of a common query is reduced from dozens of seconds to several seconds.

Create the following two single-column Indexes

ALTER TABLE `population`     ADD  INDEX `fk_city` (`city`),  ADD  INDEX `fk_birthday` (`birthday`);

You can also create the following two composite indexes:

ALTER TABLE `population`     ADD  INDEX `fk_index1` (`city`, `birthday`),  ADD  INDEX `fk_index2` (`birthday`, `city`);

(2). Use an intermediate table
Although index optimization can greatly reduce the query time, if the amount of data reaches a certain level, in some cases, when the index data reaches several millions, the query will still be slow, therefore, index optimization cannot fundamentally solve the problem. Currently, the amount of data in a table is getting larger and larger, with an average increase of 1 millions or 2 millions per month. The index optimization method is only temporary and can only solve the query problem of small data volumes. As the data volume increases rapidly, the performance optimization brought by indexes can easily reach the limit. You need to find other solutions.

We create an intermediate table population_statistics based on the characteristics of our business needs and store the statistical data in the population table in the middle table population_statistics. During the query, we directly query the data from the intermediate table population_statistics. Note: When adding, deleting, and modifying a table population, you must update the data in population_statistics at the same time. Otherwise, the data inconsistency error may occur!

Create table 'Population _ statistics '('Population _ statistics_id' bigint (64) not null AUTO_INCREMENT COMMENT 'Population statistics TABLE id', 'city' varchar (128) COLLATE utf8_bin default null comment 'city', 'birthday' int (32) default null comment 'birthday', 'total _ count' int (32) default null comment 'Population qty ', primary key ('Population _ statistics_id'), KEY 'fk _ City' ('city '), KEY 'fk _ birthday' ('birthday') query the population of a city born on a day. SELECT total_count FROM population_statistics WHERE city = 'guangzhou 'AND birthday = '2017-11-02 '; query the population of a city select count (total_count) FROM population_statistics WHERE city = 'guangzhou '; query the population of a day select count (total_count) FROM population_statistics WHERE birthday = '2017-11-02 ';

The population of a city in a certain day may have thousands or even tens of thousands of data in the population_statistics table, while the statistical table population_statistics has only one data entry at most, that is to say, the data volume in the statistical table population_statistics is only 1‰ of the population Table population. In addition, the index optimization greatly improves the query speed.

Below is a summary of common big data table optimization solutions.

1. INDEX OPTIMIZATION

By establishing a reasonable and efficient index, the query speed is improved.

I recommend that you read an index blog.

Http://blog.csdn.net/brushli/article/details/39677387

2. SQL Optimization

Organize and optimize SQL statements to optimize the query efficiency. In many cases, the index function should be taken into account.

I suggest you read an index blog written by myself.

Http://blog.csdn.net/brushli/article/details/39677387

3. Horizontal table sharding

If the data in a table presents a certain type of features, such as the time feature, you can split the table into multiple tables based on the time period.

For example, the query results are divided by year, quarter, and month. The query results are split by time period and then merged;

For example, you can split a table by region. Data in different regions is placed in different tables. Then, you can split the query and merge the query results.

4. Vertical table splitting

Splits a table into multiple tables by field. Common fields are placed in one table, and infrequently used fields or large fields are placed in another table. Because each database query is a block, and the capacity of each block is limited, usually dozens of K or dozens of K, after the table is split by field, the number of rows that can be retrieved by a single IO is usually much higher, and the query efficiency can be improved.

For example, a Member table has the following structure:

Create table 'member' ('Member _ id' bigint (64) not null AUTO_INCREMENT COMMENT 'Member TABLE id', 'name' varchar (128) COLLATE utf8_bin default null comment 'Member name', 'age' int (32) default null comment 'Member age', 'inserted' text COLLATE utf8_bin comment' member introduction ', primary key ('Member _ id '))

Introduction is a large field that saves the introduction of Members. This large field seriously affects the query efficiency. You can separate it to form a separate table.

Create table 'member' ('Member _ id' bigint (64) not null AUTO_INCREMENT COMMENT 'Member TABLE id', 'name' varchar (128) COLLATE utf8_bin default null comment 'Member name', 'age' int (32) default null comment 'Member age', primary key ('Member _ id ')) create table 'Member _ introduction '('Member _ introduction_id' bigint (64) not null AUTO_INCREMENT COMMENT 'member description TABLE id', 'member _ id' bigint (64) default null comment 'Member id', 'introduction 'text COLLATE utf8_bin comment' member introduction', primary key ('Member _ introduction_id '), KEY 'fk _ member_id '('Member _ id'), CONSTRAINT 'fk _ member_id' foreign key ('Member _ id ') REFERENCES 'member' ('Member _ id '))

5. Create an intermediate table and change the space time

In some cases, you can create an intermediate table to speed up the query. For more information, see the example at the beginning of this article.

6. Use the memory to cache data and change the space for time

Load frequently-used and infrequently-modified data into the memory and directly query data from the memory.

Popular cache technologies such as Memcache, Redis, and Ehcache can be used.

7. Use other auxiliary technologies

Solr: A Lucene-based JAVA Search Engine Technology

How should I optimize multi-table big data queries in databases?

The only advantage of a view is to simplify the SQL query during development. The efficiency is no different from that of direct query.
The entity table can be used for aggregation. Although the query is fast, the disadvantage is that it is static. If the customer information changes, it must be aggregated again to ensure information accuracy.

How can I optimize the SQL query logic for a large amount of data in a simple table?

First, you have too much data. You should consider using high-speed disk arrays, or even minicomputers, as well as professional database systems.

Second, if your database does not change much, you can consider adding indexes, especially creating indexes on scattered data fields such as product_detail.creat_dt, which can significantly improve the efficiency. If there are frequent updates, the index should not be too large. Otherwise, inserting a piece of data will be very slow.

If your application is very complex, data is being updated in large quantities, and fast query is required, there are some ways to change the time using space, such as dividing the database on different servers according to certain rules, when querying, you must first determine the conditions and then submit them to different servers for query. There are many database sharding methods. For example, you can store the data of each day on different servers by time. After the program judges the conditions, it calls the corresponding server for query.

A typical way to change the space time is to combine the two tables so that no association is required during query.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More