In my practical work, I have the honor to be exposed to the massive data processing problems. It is an arduous and complex task to process them. The reasons are as follows:
1. If the data volume is too large, any situations may exist. If there are 10 pieces of data, it is a big deal to check each piece one by one. If there are hundreds of pieces of data, you can also consider that if the data reaches tens of millions, or even hundreds of millions, it cannot be solved manually. It must be processed by tools or programs, especially in the case of massive data. For example, a format error occurs in the data, especially when the program is processed, it can still be processed normally. Suddenly a problem occurs somewhere and the program is terminated.
2. High hardware and software requirements and high system resource usage. In addition to a good method for processing massive data, the most important thing is to rationally use tools and rationally allocate system resources. In general, if the processed data is too TB-level, minicomputers should consider it. If there is a good method for general machines, you can consider it, but you must also increase the CPU and memory, it is like facing thousands of troops and horses, it is difficult to win without a single soldier.
3. Highly demanding handling methods and skills. This is also the purpose of writing this article. A good solution is the accumulation of long-term work experience of an engineer and the summary of personal experience. There are no general processing methods, but there are general principles and rules.
So what experiences and skills are there to deal with massive data? I will list what I know for your reference:
I. Selecting excellent database tools
Currently, many database tool manufacturers have high requirements on the database tools used for processing massive data. Generally, Oracle or DB2 is used, microsoft recently released SQL Server 2005 with good performance. In addition, in the BI field, tools such as databases, data warehouses, multi-dimensional databases, and data mining must also be selected. ETL tools and OLAP tools are essential, such as Informatic, eassbase. In the actual data analysis project, it takes six hours to process 60 million log data entries per day to use SQL Server 2000, and three hours to use SQL Server 2005.
2. Compile excellent program code
Data processing is inseparable from excellent program code, especially when complicated data processing is required. Good program code is crucial to data processing. It is not only about data processing accuracy, but also about data processing efficiency. Good program code should contain good algorithms, good processing processes, good efficiency, and good exception handling mechanisms.
3. Partition massive data
It is necessary to partition massive data. For example, for data accessed by year, we can partition data by year. Different databases have different partitioning methods, but the processing mechanism is basically the same. For example, the database partition of SQL Server stores different data under different file groups, and different file groups are stored under different disk partitions, so that data is dispersed, reduces disk I/O, reduces system load, and stores logs and indexes in different partitions.
4. Extensive Indexing
For massive data processing, it is required to create an index for a large table. To create an index, you must consider the specific situation. For example, you must create an index for fields such as grouping and sorting of large tables, you can also create a composite index. Be careful when creating indexes for frequently inserted tables. When I was dealing with data, I used to insert a table in an ETL process, first, delete the index, then insert the index, create the index, and perform the aggregation operation. After the aggregation is completed, the index is still deleted before the index is inserted again. Therefore, a good time is required for the index, the index fill factor and clustering and non-clustered indexes must be considered.
5. Establish a cache mechanism
When the data volume increases, the general processing tool must consider the cache issue. The cache size setting is also related to the success or failure of Data Processing. For example, when the author processes 0.2 billion data aggregation operations, the cache is set to 100000/Buffer, this is feasible for the data volume at this level.
6. Increase the virtual memory
If the system resources are limited and the memory is insufficient, you can increase the virtual memory. In my project, I have encountered a problem in processing 1.8 billion pieces of data, with 1 GB memory and 1 P4 2.4G CPU. it is problematic to aggregate such a large amount of data, if the system prompts that the memory is insufficient, the virtual memory is increased. Six 40 96 m disk partitions are created on the six disk partitions for the virtual memory, in this way, the virtual memory is increased to 4096*6 + 1024 = 25600 MB, which solves the problem of insufficient memory in data processing.
VII. Batch Processing
Massive Data Processing is difficult because of the large amount of data, one of the skills to solve the problem of massive data processing is to reduce the amount of data. Batch Processing of massive data and merging of processed data can be done one by one, which is conducive to the processing of small data volumes and will not face the problems caused by large data volumes, however, this method also needs to be implemented based on actual conditions. If data cannot be split, another method is required. However, data stored by day, month, or year can be processed separately by means of combining the data first and then.
8. Use temporary tables and intermediate tables
When the amount of data increases, you should consider collecting data in advance. The purpose of this operation is to reduce the size of a large table to a smaller table. After the partition processing is completed, some rules are used for merging. The use of temporary tables and the preservation of intermediate results during the processing are very important, if a large table cannot be processed for a massive volume of data, it can only be split into multiple small tables. If you need to perform multiple summary operations during the process, you can follow the summary steps Step by step. If you do not want to complete a statement, you can eat a fat man in one breath.
9. Optimize Query SQL statements
In the process of querying massive data, the performance of the queried SQL statements has a very large impact on the query efficiency, writing efficient and excellent SQL scripts and stored procedures is the responsibility of Database staff and a standard for testing the level of Database staff. In the process of writing SQL statements, such as reducing Association, it is necessary to design an efficient database table structure with few or no cursors. In my work, I tried to use a cursor for 0.1 billion rows of data, and there was no result after three hours of operation. This was a must be processed by a program.
10. Process in text format
For general data processing, you can use a database. If complex data processing requires the use of a program, you must select the program operation text between the program operation database and the program operation text, the reason is: the process of text operations is fast, text processing is not prone to errors, and text storage is unrestricted. For example, if a large amount of network logs are in text or csv format (in text format), data cleansing is involved in processing the logs, it is not recommended to import the database for cleaning.
11. Customize Powerful cleaning rules and error handling mechanisms
There is inconsistency in the mass data, and it is very likely that there will be some flaws. For example, some time fields in the same data may be non-standard time, and the cause may be application errors or system errors, strong data cleansing rules and error handling mechanisms must be developed.
12. Create a view or Materialized View
The data in the view comes from the base table. The processing of massive data can distribute data to each base table according to certain rules, and can be performed based on the view during query or processing, this disperses disk I/O, just as the difference between 10 ropes and one pillar.
13. Avoid using 32-bit hosts (in extreme cases)
At present, many computers are 32-bit, so the memory needs of the program to be compiled are limited, and a lot of massive data processing must consume a lot of memory, this requires better performance, and the number of digits is also very important.
14. Operating System Problems
In the process of massive data processing, in addition to high requirements on databases and processing programs, the requirements for the operating system are also placed in an important position. Generally, servers must be used, in addition, the requirements for system security and stability are also relatively high. Especially for the operating system's own cache mechanism, temporary space processing and other issues need to be comprehensively considered.
15. Data Warehouse and multi-dimensional database storage
OLAP must be taken into account for increasing data volumes. Traditional reports may produce results in 5 or 6 hours, while Cube-based queries may only take several minutes, therefore, the powerful tool for processing massive data is OLAP multi-dimensional analysis, that is, creating a data warehouse, creating a multi-dimensional dataset, and performing report Presentation and Data Mining Based on Multi-dimensional datasets.
16. Use sampled data for Data Mining
Massive Data-based data mining is gradually emerging. In the face of sea-heavy data, general mining software or algorithms are often processed by data sampling. This error is not very high, this greatly improves the processing efficiency and success rate. Pay attention to the integrity and integrity of data during sampling to prevent excessive deviations. I have sampled 0.1 billion million rows of table data and extracted 4 million rows. The error tested by the software is 5‰, which is acceptable to the customer.
There are also some methods that need to be used in different situations and scenarios, such as using the proxy key and other operations. The advantage is that the aggregation time is accelerated, because the aggregation of the numeric type is much faster than the aggregation of the numeric type. Similar situations need to be addressed based on different needs.
Massive Data is a development trend and is increasingly important for data analysis and mining. extracting useful information from massive data is important and urgent, which requires accurate processing and high accuracy, in addition, the processing time is short and valuable information can be obtained quickly. Therefore, the research on massive data is promising and worthy of extensive and in-depth research.
From: http://hi.baidu.com/justinzcs/blog/item/9cd44223bece7f4d9358077a.html