How to deal with the massive data in the database and the experience and skill of processing the data of the database

Source: Internet
Author: User
Tags error handling exception handling knowledge base

The massive data is the development tendency, to the data analysis and the excavation also more and more important, it is important and urgent to extract useful information from massive data, which requires accurate processing, high precision, and short processing time to get valuable information quickly, so the research on massive data is very promising, and it is worthy of extensive and in-depth research. Data mining based on massive data is rising gradually, faced with huge amount of data, the general mining software or algorithms often use data sampling method to deal with, this error will not be very high, greatly improve the processing efficiency and processing success rate.

In the actual work environment, many people will encounter massive data this complex and arduous problem, its main difficulties are as follows:

One, the amount of data is too large, the data in any situation may exist.

If you say there are 10 of data, it's not big enough to check each one, artificial treatment, if there are hundreds of data, you can also consider, if the data to tens, or even billion, it is not the hands of the solution, must be handled through tools or procedures, especially in the vast number of data, what situation may exist, for example, There is a problem with the format of the data, especially in the process of processing, the front can be normal processing, suddenly to a local problem appeared, the program terminated.

Second, high hardware and software requirements, the system resources to occupy too high

To deal with massive data, in addition to good methods, the most important is the rational use of tools, reasonable allocation of system resources. In general, if the processing of data over TB, minicomputer is to be considered, ordinary machines if there are good ways to consider, but also must increase the CPU and memory, as in the face of an army, the courage of a soldier without a stroke is difficult to win.

Third, a high demand for processing methods and skills.

This is also the purpose of writing this article, good treatment is an engineer's long-term work experience accumulation, but also a summary of personal experience. There is no common approach, but there are general principles and rules.

Here's a detailed description of the experience and techniques of handling massive data:

First, the use of excellent database Tools

There are a lot of database tools manufacturers, the processing of massive data for the use of database Tools is high, the general use of Oracle or DB2, Microsoft recently released SQL Server 2005 performance is also good. In addition, in the BI field: Database, Data Warehouse, multidimensional database, data mining, blog Knowledge Base and other related tools to choose, like good ETL tools and good OLAP tools are very necessary, such as informatic,eassbase. In the actual data analysis project, 60 million log data per day is processed, it takes 6 hours to use SQL Server 2000, and SQL Server 2005 takes only 3 hours.

Second, the preparation of excellent program code

Processing data is inseparable from good program code, especially in complex data processing, you must use the program. Good program code is very important for data processing, which is not only the problem of data processing accuracy, but also the problem of efficiency of processing. Good program code should contain good algorithm, including good processing flow, including good efficiency, including good exception handling mechanism.

Three, the massive data carries on the partition operation

It is necessary to partition large amounts of data, for example, for data accessed by year, we can partition by year, different databases have different partitioning methods, but the processing mechanism is basically the same. For example, SQL Server's database partition is to store different data under different filegroups. The different file groups are stored in different partition, so that the data can be dispersed, reduce disk I/O, reduce the system load, but also the log, index, etc. under different partitions.

Iv. establishing an extensive index

For massive data processing, indexing of large tables is a must. The establishment of the index to take into account the specific circumstances, such as for large table grouping, sorting and other fields, to establish the corresponding index, and generally can establish a composite index, the table is often inserted to be careful when the index, the author in processing data, once in an ETL process, When the table is inserted, the index is first deleted, then inserted, indexed, and the aggregation is implemented, the aggregation is completed, the index is inserted again before or after it is finished, so the index has to use a good time, the fill factor of the index and the clustered and nonclustered indexes are considered.

V. Establishment of a caching mechanism

When the amount of data is increased, the general processing tool takes into account the caching problem. Cache Size setting is also related to the success or failure of data processing, for example, the author in processing 200 million data aggregation operations, the cache is set to 100,000/buffer, which is feasible for this level of data volume.

VI. Increase Virtual memory

If system resources are limited and memory hints are insufficient, you can solve them by adding virtual memory. The author in the actual project has encountered 1.8 billion data for processing, memory for 1gb,1 a P4 2.4G CPU, for such a large amount of data aggregation operation is problematic, the hint of insufficient memory, then use the method to increase virtual memory to solve, in 6 disk partitions on the establishment of 6 respectively 4096M disk partition for virtual memory, so virtual memory is increased to 4096*6 + 1024 = 25600 M, which solves the problem of insufficient memory in data processing.

Vii. Batching Process

Mass data processing is difficult because of the large amount of information, then one of the techniques to solve the problem of mass data processing is to reduce the amount of data. Mass data can be processed in batches, and then processed data to be merged, so that one by one, in favor of small amount of data processing, not to face the problem of large data, but this method is also due to the potential, if not allowed to split data, but also need to find another way. However, the general data by day, by month, by year and so on storage, can use the first division after the combination of the method of data processing separately.

Viii. use of temporary tables and intermediate tables

When the volume of data increases, the process should be considered in advance rollup. The purpose of this is to make a piecemeal, large table to small table, block processing completed, and then use a certain rule to merge, the processing of temporary tables in the use and the preservation of intermediate results is very important, if for the huge amount of data, large table can not handle, only split into a number of small tables. If you need to process a multi-step rollup, you can follow the steps of the summary step by step, not one statement complete, one breath to eat a fat man.

IX. Optimizing Query SQL statements

In the process of query processing of massive data, the performance of the query's SQL statement has a significant impact on query efficiency, and writing highly efficient SQL scripts and stored procedures is the responsibility of the database worker and a standard for verifying the level of the database worker, in the process of writing SQL statements, such as reducing correlation, It is necessary to design an efficient database table structure with little or no cursor. In the work of the author to try to 100 million rows of data using cursors, running 3 hours without results, it must be used to process the program.

Use text Format for processing

A database can be used for general data processing, if the complex data processing, must use the procedure, then in the procedure operation database and the program operation text Choice, is must select the program operation text, the reason is: The procedure operation text speed is fast, the text processing is not easy to make mistakes, the text storage unrestricted and so on. For example, a large number of web logs are in the form of text or CSV format (text format), it is involved in the processing of data cleaning, is to use the program to deal with, and do not recommend to import the database to do cleaning.

Xi. Custom-Strong cleaning rules and error handling mechanisms

There are inconsistencies in the mass of data, and there is a high likelihood that there will be flaws in some place. For example, the same data in the Time field, some may be non-standard time, the cause may be the application of errors, system errors, etc., this is in the process of data processing, must be developed a strong rules and fault handling mechanism.

12. Create a view or materialized view

The data in the view comes from the base table, the processing of massive data, the data can be scattered in a certain rules of the base table, query or processing can be based on the view, so that the disk I/O, dispersed, as 10 rope hanging a pillar and a pillar hanging a difference.

13, avoid using 32-bit machine (extreme situation)

At present, many of the computers are 32-bit, then write a program to the memory needs of the restrictions, and many of the massive data processing is necessary to consume a lot of memory, which requires better performance of the machine, which the limit on the number of digits is also very important.

14, consider operating system problems

Mass data processing, in addition to the database, processing procedures, and other requirements are relatively high, the requirements of the operating system has been placed in an important position, generally must use the server, but also the security and stability of the system and other requirements are relatively high. Especially to the operating system itself caching mechanism, temporary space processing and other issues need to be considered comprehensively.

The use of data warehouses and multidimensional database storage

Increase the amount of data must be considered OLAP, the traditional report may be 5 or 6 hours out of the result, and based on the cube query may only need a few minutes, so processing a large number of data is OLAP multidimensional analysis, that is, the establishment of data warehouses, the establishment of multidimensional data sets, based on the multidimensional data sets for report presentation and data mining.

16, the use of sampling data, data mining

Data mining based on massive data is rising gradually, faced with huge amount of data, the general mining software or algorithms often use data sampling method to deal with, this error will not be very high, greatly improve the processing efficiency and processing success rate. General sampling should pay attention to the integrity of the data and to prevent excessive deviation. The author has 120 million rows of table data sampling, extraction of 4 million lines, the test software test processing error of 5 per thousand, customers can accept.

There are also methods that need to be used in different situations and situations, such as the use of surrogate keys, which has the advantage of speeding up aggregation time because the aggregation of numeric types is much faster than the aggregation of character types. A similar situation needs to be addressed for different needs.


This article from Csdn Blog, reproduced please indicate the source: http://blog.csdn.net/cpp_lzth/archive/2010/04/10/5470148.aspx

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.