Database SQL Tens Data Scale processing Overview _ database Other
Source: Internet
Author: User
1. Too much data. It must not be put on a table.
Like the periodic table. One months 10 million, a year is 120 million, so the cumulative down certainly not. So all are based on a cycle of data a table. Even a cycle of data will be divided into several tables. The main consideration is the actual amount of data. When you create a new table, it is possible that the table needs to be indexed, but you must first cancel the index, or create the table, import the data, and then build the index.
When necessary, back up to tape or other media after the statistics have been completed. And then clear off.
From the problem domain, the data association is the largest in a period. For example, the statistics of a customer account for a single total amount, year-on-year increase in the last month, there is 0 of customer charges and so on. So many, the reference data is nothing more than this cycle, or two cycles, or even more is a quarter, or six months of appearance (similar to three months of continuous 0 calls, or three months of continuous arrears and so on, the amount of such reports may be a year of data). And such a situation in the data mining or advanced management reports are more common, the general Business Department use of the interface, it is impossible to contain such statistics.
so the data is divided by the table, even can be separated by database, more convenient management.
We need to dispel the idea that these data, like the garbage disposal of the sanitation workers, are almost a bit of a multi-step approach with manual disposal, that is, the long-term existence and frequent use of conventional data (such as customer profiles). So we can change our mind, just try to do our best to deal with it when we need it, and clean it up when we don't need it. In other words, for example, you can divide the table into 100 tables and 1000 tables. Just make it easy to count and get the data you need.
View just says you can write a SELECT statement in a simpler, no improvement in speed.
The main point is that the way you do the table can be built to reduce access to all the data, you can improve speed. For example, if you do a statistic, that data happens to be in a certain form. For example, you have 10 divisions, and when you count id=1 this division, you just put the data in the first table, you can in the memory through the judge, only access to the first one, so as to improve the statistical speed. If your statistics need to count all the data in the table, the processing rate is still as slow.
2. Assuming that the data for each table is in hundreds of thousands of, there is no bottleneck in statistics. There should be no problem with the regular database.
3. The need for pretreatment.
Some people ask: I count 10 million data summary, how long, can improve ... Just imagine how long it will take you to add up all the Chinese savings. Look at the scale of the problem, in fact, the complex database DBMS, we said he could not escape: to find the conditions of the data, a single sum of the calculation process. The Where condition is not mentioned for the moment. The necessity of preprocessing is that data processing of this size is a time-consuming process in itself, and it is necessary for us to deal with its results in a table, or multiple tables, in advance. When the user queries, it is displayed again. For example, 10 million data divided into 10 divisions, to see each division of the receivable growth, then we can advance statistics to the partial expense table, then the client report is displayed, very quickly. If any data rollup is to be counted from the raw data, it is unrealistic. So we can set up the raw data table, the middle result table, the result table, the summary table, the month table, the period table and so on things. Stepwise statistical Attribution.
The other thing to mention is that this action must be time-consuming and! Such data, if implemented periodically by the server's stored procedures, will be processed only once, and any client will only generate reports from the results table. If you do not use this method, any client reports are generated from the original data, which is theoretically possible, but the processing of such thousands data rollup will be done n times. And the time is not allowed.
Also, such a statistical process is best to separate db for storage, and common data such as customer profile, it is best to copy one to the new db to deal with. This can not interfere with the normal use.
You can run this process at night, or in another DB or on another server. After processing, write a flag to tell the main db, then the client can count these reports.
4. Make a calculated field on a single line of data. For example, if a record is produced in 2009-01-01 12:00:00.001, if your statistics just need to be counted for a certain period of time, it's best to add fields, such as hour fields, the next batch order, get the hours, and then count.
5. A function in column is taboo in the SELECT statement. Because the function will cause the query condition not to go to the index, but to walk through all the data. So you just look up a piece of data and you go through all the data, and that's pathetic.
6. The conditions as far as possible are the numbers, that is, all with IDs, such as division, town, business type, access type, customer address, and so on, all need to use the FK way of the code, the main table with only digital ID, please remember is the digital ID. An integer number is the fastest data type to compute. If the amount is large, you can use decimal (decimal =0). The varchar type is inefficient, but it seems to have a MD5 algorithm for SQL, and I think I can try this method (I haven't tried it yet).
7. Index, this is a massive data query to solve the problem.
No index, that is traversal. The index is not covered and will walk through.
8. Complex statistics, using memory to do step-by-step processing, and then get the results, a SELECT statement to achieve a more relaxed and more understanding.
And it takes a lot less time to take a table. Of course, very complex statistics may need to use conditional judgments, loops, etc., and a SELECT statement cannot be processed. A multiple-layered where clause is also inefficient and easily occupies a table.
In principle, the problem I am discussing here is not the kind of small case based on the content management of the website, mainly for the enterprise application. For example, to look up a "Stock customer growth Chart", the problem is not simple to direct comparison of two months of total amount of money so simple, but also to find out before his phone number, such as how much more than the sum of the statistical object. So, my understanding: complex problems that must be stored in procedures. Really do a few projects will understand, write SQL statements more than programming code. The real program, in fact, is SQL.
Finally, if the experience is rich enough, the statistical process written out, its execution time in a few minutes or even a few hours are normal. So beginners should understand that the amount of data is proportional to the processing time. If you normally handle a few data feel very quickly, the amount of data suddenly increased by several orders of magnitude, do not think the time can also be optimized to a few seconds.
MRP in ERP can be calculated, usually to a few hours. It's all normal. (mainly material more, BOM more, calculation steps too much cause)
9. Add a point. If the amount of data exceeds the tens of our title, or even billions of orders of magnitude. There is no problem, or divide and conquer the idea is to put the data on multiple servers running in parallel. As if donations to the disaster area, relying on a person's strength is not. Many people have great power. such as data sorting, only the raw data and basic information, there are some billing strategy and so on. Can be distributed across multiple servers at the same time, is also necessary. It depends on the amount of data you have and the speed of your individual processing and the total processing time you require. Some people say that SELECT statements also need to be distributed? Can only say that if it is necessary to do so. For example, if you want to return all the exception data, it can also be retrieved from each table, and then join together, I think it is possible.
Word:
One. Rational design of the table structure, making statistical summary of the most efficient (including FK design and digital ID, without varchar, index design, computational fields);
Two. Reasonable table, make the scale of single table data appropriate;
Three. The memory is processed by several steps.
Four. Data is processed in advance.
Five. Distributed on multiple servers at the same time.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.
A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service