Summary of Database SQL processing of tens of millions of data

Last Update:2018-12-08 Source: Internet

Author: User

Tags integer numbers

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Too much data. It cannot be put in a table.

For example, monthly periodic table. 10 million a month, 0.12 billion in a year, so the accumulation will not work. Therefore, a table is based on periodic data. Even data in one cycle is divided into several sub-tables. It mainly depends on the actual data volume. When creating a new table, you may need to have an index, but you must cancel the index first, or create a table first, import the data, and then create an index.

If necessary, back up the data to tape or other media. Then clear it.

From the perspective of the problem domain, the data is correlated most in a period. For example, if you count the total amount of a customer's bill for a specific billing period, there will be a year-on-year increase compared with the previous month, and there will be zero-call customers. The reference data is similar to the current cycle or two cycles, or even a quarter or half a year, or if you fail to pay the overdue payment for three months, you may need a year's data for such a report ). In addition, this situation is common in data mining or advanced management reports. Generally, the interface used by the business department cannot contain such statistics.

Therefore, data can be separated by table, or even by database, which is easier to manage.

We need to dispel an inherent idea that such data, like garbage disposal by sanitation workers, is almost a multi-step approach with manual disposal, that is, it will not be used as regular data (such as basic customer data) for a long time and frequently used. Therefore, we can change our mindset by trying our best to solve the problem as needed, instead of cleaning it out when it is not needed. That is to say, for example, you can divide a table into 100 and 1000 tables. You only need to conveniently collect statistics and obtain the required data.

View only means that you can simply write a select statement without improving the speed.

Mainly, you can create a table sharding method to reduce access to all data and increase the speed. For example, if you make a Statistical Statement, the data exactly exists in a table shard. For example, if you have 10 segments, and you calculate the id = 1 segment, you just put the data in the first partition table, you can judge in the memory, only access the first sub-table to increase the speed of statistics. If your statistics require statistics on the data in all sub-tables, the processing speed is still as slow.

2. If the number of data entries in each table is 100,000, there is no bottleneck in statistics. Regular databases should have no problems.

3. Necessity of preprocessing.

Someone asked: How long will it take for me to collect 10 million pieces of data? Can it be improved... Imagine how long it will take you to increase the total deposits of Chinese people? Let's look at the scale of this problem. In fact, the complicated database dbms, we say he can't escape it: Find the qualified data, and add one by one. The where condition is not mentioned for the moment. The necessity of preprocessing is that data processing of such a scale is a very time-consuming process. We need to process the results in one or more tables in advance. The query result is displayed. For example, if 10 million of the data is divided into 10 segments, we need to see the growth of each segment, then we can pre-collect statistics to the segment Expense Table, and the client report is displayed very quickly. It is unrealistic to collect statistics from raw data for any data summary. Therefore, you can set the original data table, intermediate result table, result table, summary table, monthly table, and period table. Gradually count the ownership.

In addition, such actions must be time-consuming and! If such data is regularly and regularly executed by the server's storage process, the processing scale is only once, and any client generates reports only from the result table. If this method is not used, any client report is generated from the raw data. Theoretically, this method is acceptable, but the process of aggregation of tens of millions of data records will be performed N times. And time is not allowed.

In addition, it is best to store the statistical process in separate databases, while public data, such as basic customer data, should be copied to the new database for processing. This will not interfere with normal use.

You can run this process at night, on another database, or on another server. After processing, write a flag to notify the primary db, and the client can count these reports.

4. Calculate fields for a single row of data. For example, the generation time of a record is 12:00:00. 001. If you just need to make statistics for a certain period of time, it is best to add a field, such as the hour field, and run the next batch processing command to obtain the number of hours before making statistics.

5. columns are not recommended in select statements. Because the function will cause the query condition to traverse all data without indexing. In this way, you can query a piece of data and traverse all the data.

6. all conditions should be numbers, that is, IDs, such as segments, towns, business types, access types, customer addresses, and so on, all of which must be encoded using the fk method. Only numerical IDs are used in the primary table, remember that it is a digital id. Integer numbers are the fastest data type. If the amount is large, you can use decimal (decimal = 0 ). The varchar type is very inefficient, but it seems that there is an SQL md5 algorithm. I want to try this method (I have not tried it yet ).

7. indexing, which is the primary solution to massive data queries.

If there is no index, it means traversal. If the index is not covered, it will also be traversed.

8. For complex statistics, it is much easier and clearer to use memory for step-by-step processing and then get results.

In addition, the table takes much shorter time. Of course, for complex statistics, conditional judgment and loops may be used. A select statement cannot be processed. Clauses in multi-layer where statements are also inefficient and easy to use for table writing.

In principle, the questions I have discussed here are not small cases based on website content management, mainly for enterprise use. For example, if you look up a "increase table for existing customers", the problem is not as simple as directly comparing the total number of calls for two months. You have to find out how he used to pay the fees, for example, the amount of money that exceeds the limit is included in the statistical object. Therefore, I understand that stored procedures are required for complex problems. Only a few projects can I understand that writing SQL statements is much more than programming code. The real program is actually SQL.

Finally, if the experience is rich enough, the execution time of the statistical process written is normal in minutes or even hours. Therefore, Beginners should understand that the data volume is proportional to the processing time. If several pieces of data are processed quickly at ordinary times, the data volume suddenly increases by several orders of magnitude. Do not think the time can be optimized to several seconds.

It usually takes several hours to calculate the MRP in ERP. This is normal. (This is mainly caused by a large number of materials, a large number of bom, and too many calculation steps)

9. add one point. If the data volume exceeds the title's ten million or even several billion orders of magnitude. That is, there is no problem. We still need to divide and conquer the problem by running data on multiple servers in parallel. Just like donating money to the disaster area, it is impossible to rely on the power of one person. A large number of people. Similar to data sorting, you only need raw data and basic data, as well as some billing policies. It is also necessary to distribute data across multiple servers for processing at the same time. It depends on your data volume, the processing speed of a single server, and the total processing time you require. Some people say that the select statement also needs distribution? It can only be said that it can be done if necessary. For example, if you want to return abnormal data of all phone numbers, you can perform retrieval from each server and merge them together.

All in all:

I. Reasonably design the table structure to make statistical summary the most efficient (including fk design and digital id, without varchar, index design, and calculation fields );

II. Proper table sharding to make the data size of a single table appropriate;

3. It is processed in multiple steps using memory.

4. Data is pre-processed.

5. Distributed across multiple servers for processing at the same time.

That is, divide and conquer and preprocessing.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More