Massive Data processing and analysis (from the development perspective)
Source: Internet
Author: User
In my practical work, I have the honor to be exposed to the massive data processing problems. It is an arduous and complex task to process them. The reasons are as follows: 1. If the data volume is too large, there may be any situation in the data. If there are 10 pieces of data, it is a big deal to check each piece one by one. If there are hundreds of pieces of data, you can also consider that if the data reaches tens of millions, or even hundreds of millions, it cannot be solved manually. You must use a tool or Program Processing, especially in the case of massive data, there may be any situation. For example, there is a problem with the format of the data, especially when the program is processed, it can still be processed normally, suddenly a problem occurs somewhere and the program stops. 2. High hardware and software requirements and high system resource usage. In addition to a good method for processing massive data, the most important thing is to rationally use tools and rationally allocate system resources. In general, if the processed data is too TB-level, minicomputers should consider it. If there is a good method for general machines, you can consider it, but you must also increase the CPU and memory, it is like facing thousands of troops and horses, it is difficult to win without a single soldier. 3. Highly demanding handling methods and skills. This is also the purpose of writing this article. A good solution is the accumulation of long-term work experience of an engineer and the summary of personal experience. There are no general processing methods, but there are general principles and rules. So what experiences and skills are there to deal with massive data? I will list what I know for your reference: I, Currently, many database tool manufacturers use excellent database tools. Processing of massive data requires a high level of database tools. Generally, Oracle or DB2 is used, microsoft recently released SQL Server 2005 with good performance. In addition, in the Bi field, tools such as databases, data warehouses, multi-dimensional databases, and data mining must also be selected. ETL tools and OLAP tools are essential, such as informatic, eassbase. In the actual data analysis project, it takes six hours to process 60 million log data entries per day to use SQL Server 2000, and three hours to use SQL Server 2005. II, Write Excellent programs Code Data processing is inseparable from excellent program code, especially when complicated data processing is required. Good program code is crucial to data processing. It is not only about data processing accuracy, but also about data processing efficiency. Good program code should contain good Algorithm , Including a good handling process, a good efficiency, and a good exception handling mechanism. III, Partitioning massive data is necessary to partition massive data. For example, for data accessed by year, we can partition by year. Different databases have different partitioning methods, however, the processing mechanism is basically the same. For example, the database partition of SQL server stores different data under different file groups, and different file groups are stored under different disk partitions, so that data is dispersed, reduces disk I/O, reduces system load, and stores logs and indexes in different partitions. IV, It is required to create an index for a large table to process massive data. To create an index, consider the specific situation, such as grouping and sorting fields for large tables, you need to create corresponding indexes. You can also create composite indexes. Be careful when creating indexes for frequently inserted tables. When I was dealing with data, I was in an ETL process, when you insert a table, first Delete the index, then insert it, create the index, and perform the aggregation operation. After the aggregation is completed, the index is still deleted before the index is inserted again. Therefore, the index should take a good time, the index fill factor and clustering and non-clustered indexes must be considered. V, When the data volume increases, the general processing tool must consider the cache issue. The cache size setting is also related to the success or failure of Data Processing. For example, when the author processes 0.2 billion data aggregation operations, the cache is set to 100000/buffer, this is feasible for the data volume at this level. VI, Increase the virtual memory. If the system resources are limited and the memory is insufficient, you can increase the virtual memory. In my project, I have encountered a problem in processing 1.8 billion pieces of data, with 1 GB memory and 1 P4 2.4g CPU. it is problematic to aggregate such a large amount of data, if the system prompts that the memory is insufficient, the virtual memory is increased. Six 40 96 m disk partitions are created on the six disk partitions for the virtual memory, in this way, the virtual memory is increased to 4096*6 + 1024 = 25600 MB, which solves the problem of insufficient memory in data processing. VII, Batch Processing of massive data processing is difficult because of the large amount of data, one of the skills to solve the problem of massive data processing is to reduce the amount of data. Batch Processing of massive data and merging of processed data can be done one by one, which is conducive to the processing of small data volumes and will not face the problems caused by large data volumes, however, this method also needs to be implemented based on actual conditions. If data cannot be split, another method is required. However, data stored by day, month, or year can be processed separately by means of combining the data first and then. 8, When you use temporary tables and intermediate tables to increase data volumes, you must consider summarizing them in advance. The purpose of this operation is to reduce the size of a large table to a smaller table. After the partition processing is completed, some rules are used for merging. The use of temporary tables and the preservation of intermediate results during the processing are very important, if a large table cannot be processed for a massive volume of data, it can only be split into multiple small tables. If you need to perform multiple summary operations during the process, you can follow the summary steps Step by step. If you do not want to complete a statement, you can eat a fat man in one breath. IX, Optimize the query of SQL statements in the Process of querying massive data, the performance of the queried SQL statements has a very large impact on the query efficiency, writing efficient and excellent SQL scripts and stored procedures is the responsibility of Database staff and a standard for testing the level of Database staff. In the process of writing SQL statements, such as reducing Association, it is necessary to design an efficient database table structure with few or no cursors. In my work, I tried to use a cursor for 0.1 billion rows of data, and there was no result after three hours of operation. This was a must be processed by a program. 10, You can use a database to process general data in text format. If you need to use a program to process complex data, you can operate databases and programs. Composition You must select the program operation text because the program operation text is fast, the text processing is not prone to errors, and the text storage is unrestricted. For example, if a large amount of network logs are in text or CSV format (in text format), data cleansing is involved in processing the logs, it is not recommended to import the database for cleaning. XI, There is inconsistency in the mass data by customizing Powerful cleaning rules and error handling mechanisms, and there is a possibility of some flaws. For example, some time fields in the same data may be non-standard time, and the cause may be application errors or system errors, strong data cleansing rules and error handling mechanisms must be developed. 12, The data in a created or materialized view comes from the base table. Processing of massive data can distribute data to each base table according to certain rules, and can be performed based on the view during query or processing, this disperses disk I/O, just as the difference between 10 ropes and one pillar. XIII, Avoid using 32-bit sub-computers (in extreme cases). Many of the current computers are 32-bit, so the memory needs of the program to be compiled are limited, A lot of massive data processing requires a large amount of memory consumption, which requires better performance. The limitation on the number of digits is also very important. 14th, Considering the operating system problems, in addition to high requirements on databases and processing programs, the requirements on the operating system are also important. Generally, servers must be used, in addition, the requirements for system security and stability are also relatively high. Especially for the operating system's own cache mechanism, temporary space processing and other issues need to be comprehensively considered. 15th, OLAP must be taken into account when using data warehouses and multi-dimensional databases to store more data. Traditional reports may produce results in 5 or 6 hours, while cube-based queries may only take several minutes, therefore, the powerful tool for processing massive data is OLAP multi-dimensional analysis, that is, creating a data warehouse, creating a multi-dimensional dataset, and performing report Presentation and Data Mining Based on Multi-dimensional datasets. 16th, Sampling Data is gradually emerging for data mining based on massive data. Generally, mining software or algorithms process data by sampling, this error is not very high, greatly improving the processing efficiency and the processing success rate. Pay attention to the integrity and integrity of data during sampling to prevent excessive deviations. I have sampled 0.1 billion million rows of table data and extracted 4 million rows. The error tested by the software is 5‰, which is acceptable to the customer. There are also some methods that need to be used in different situations and scenarios, such as using the proxy key and other operations. The advantage is that the aggregation time is accelerated, because the aggregation of the numeric type is much faster than the aggregation of the numeric type. Similar situations need to be addressed based on different needs. Massive Data is a development trend and is increasingly important for data analysis and mining. extracting useful information from massive data is important and urgent, which requires accurate processing and high accuracy, in addition, the processing time is short and valuable information can be obtained quickly. Therefore, the research on massive data is promising and worthy of extensive and in-depth research. Original article: http://blog.csdn.net/DaiZiLiang/article/details/1432193
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.