Massive Data processing and analysis

Source: Internet
Author: User
Massive Data processing and analysis  In my practical work, Dai ziliang, Beijing myisch Technology Co., Ltd., was lucky to have access to the massive data processing problem. It was an arduous and complex task to deal with it. The reasons are as follows: 1. If the data volume is too large, there may be any situation in the data. If there are 10 pieces of data, it is a big deal to check each piece one by one. If there are hundreds of pieces of data, you can also consider that if the data reaches tens of millions, or even hundreds of millions, it cannot be solved manually. It must be processed by tools or programs, especially in the case of massive data. For example, a format error occurs in the data, especially when the program is processed, it can still be processed normally. Suddenly a problem occurs somewhere and the program is terminated. 2. High hardware and software requirements and high system resource usage. In addition to a good method for processing massive data, the most important thing is to rationally use tools and rationally allocate system resources. In general, if the processed data is too TB-level, minicomputers should consider it. If there is a good method for general machines, you can consider it, but you must also increase the CPU and memory, it is like facing thousands of troops and horses, it is difficult to win without a single soldier. 3. Highly demanding handling methods and skills. This is also the purpose of writing this article. A good solution is the accumulation of long-term work experience of an engineer and the summary of personal experience. There are no general processing methods, but there are general principles and rules. So what experiences and skills are there to deal with massive data? I will list what I know for your reference: 1. There are many database tool manufacturers currently using excellent database tools, the processing of massive data has high requirements on the database tools used. Generally, Oracle or DB2 is used, and the SQL Server 2005 recently released by Microsoft has good performance. In addition, in the BI field, tools such as databases, data warehouses, multi-dimensional databases, and data mining must also be selected. ETL tools and OLAP tools are essential, such as Informatic, eassbase. In the actual data analysis project, it takes six hours to process 60 million log data entries per day to use SQL Server 2000, and three hours to use SQL Server 2005. 2. Writing excellent program code to process data is inseparable from excellent program code, especially when complicated data processing is required. Good program code is crucial to data processing. It is not only about data processing accuracy, but also about data processing efficiency. Good program code should contain good algorithms, good processing processes, good efficiency, and good exception handling mechanisms. 3. Partitioning massive data is necessary to partition massive data. For example, we can partition data accessed by year, different databases have different partitioning methods, but the processing mechanism is basically the same. For example, the database partition of SQL Server stores different data under different file groups, and different file groups are stored under different disk partitions, so that data is dispersed, reduces disk I/O, reduces system load, and stores logs and indexes in different partitions. 4. Establishing a wide range of indexes for massive data processing is required to create an index for a large table. To create an index, consider the specific situation, such as grouping and sorting fields for large tables, you need to create corresponding indexes. You can also create composite indexes. Be careful when creating indexes for frequently inserted tables. When I was dealing with data, I was in an ETL process, when you insert a table, first Delete the index, then insert it, create the index, and perform the aggregation operation. After the aggregation is completed, the index is still deleted before the index is inserted again. Therefore, the index should take a good time, the index fill factor and clustering and non-clustered indexes must be considered. 5. Set up a cache mechanism. When the data volume increases, the general processing tool must consider the cache issue. The cache size setting is also related to the success or failure of Data Processing. For example, when the author processes 0.2 billion data aggregation operations, the cache is set to 100000/Buffer, this is feasible for the data volume at this level. 6. Increase the virtual memory. If the system resources are limited and the memory is insufficient, you can increase the virtual memory. In my project, I have encountered a problem in processing 1.8 billion pieces of data, with 1 GB memory and 1 P4 2.4G CPU. it is problematic to aggregate such a large amount of data, if the system prompts that the memory is insufficient, the virtual memory is increased. Six 40 96 m disk partitions are created on the six disk partitions for the virtual memory, in this way, the virtual memory is increased to 4096*6 + 1024 = 25600 MB, which solves the problem of insufficient memory in data processing. 7. Batch Processing of massive data processing is difficult because of the large amount of data, one of the skills to solve the problem of massive data processing is to reduce the amount of data. Batch Processing of massive data and merging of processed data can be done one by one, which is conducive to the processing of small data volumes and will not face the problems caused by large data volumes, however, this method also needs to be implemented based on actual conditions. If data cannot be split, another method is required. However, data stored by day, month, or year can be processed separately by means of combining the data first and then. 8. When the data volume of temporary tables and intermediate tables is increased, the data volume should be summarized in advance during processing. The purpose of this operation is to reduce the size of a large table to a smaller table. After the partition processing is completed, some rules are used for merging. The use of temporary tables and the preservation of intermediate results during the processing are very important, if a large table cannot be processed for a massive volume of data, it can only be split into multiple small tables. If you need to perform multiple summary operations during the process, you can follow the summary steps Step by step. If you do not want to complete a statement, you can eat a fat man in one breath. 9. Optimize the query of SQL statements in the Process of querying massive data. The performance of the queried SQL statements has a huge impact on the query efficiency, writing efficient and excellent SQL scripts and stored procedures is the responsibility of Database staff and a standard for testing the level of Database staff. In the process of writing SQL statements, such as reducing Association, it is necessary to design an efficient database table structure with few or no cursors. In my work, I tried to use a cursor for 0.1 billion rows of data, and there was no result after three hours of operation. This was a must be processed by a program. 10. You can use a database to process general data in text format. If you must use a program to process complex data, select between the program operation database and the program operation text, it is necessary to select the program operation text, because: The program operates the text fast, the text processing is not prone to errors, and the text storage is not limited. For example, if a large amount of network logs are in text or csv format (in text format), data cleansing is involved in processing the logs, it is not recommended to import the database for cleaning. 11. inconsistency exists in the mass data by customizing Powerful cleaning rules and error handling mechanisms, and there is a high possibility of some flaws. For example, some time fields in the same data may be non-standard time, and the cause may be application errors or system errors, strong data cleansing rules and error handling mechanisms must be developed. 12. Data in a view or materialized view comes from a base table. processing massive data can distribute data to each base table according to certain rules, the query or processing process can be performed based on the view, which disperses disk I/O, just as the difference between 10 ropes hanging a column and a column. 13. Avoid using 32-bit sub-computers (in extreme cases). Currently, many computers are 32-bit, so the memory needs of the program to be compiled are limited, A lot of massive data processing requires a large amount of memory consumption, which requires better performance. The limitation on the number of digits is also very important. 14. Considering the problems of the operating system, in addition to high requirements on databases and processing programs, the requirements on the operating system are also important, generally, servers must be used, and the requirements for system security and stability are also high. Especially for the operating system's own cache mechanism, temporary space processing and other issues need to be comprehensively considered. 15. OLAP must be taken into account when using data warehouses and multi-dimensional databases to store more data. Traditional reports may produce results in 5 or 6 hours, while Cube-based queries may only take several minutes, therefore, the powerful tool for processing massive data is OLAP multi-dimensional analysis, that is, creating a data warehouse, creating a multi-dimensional dataset, and performing report Presentation and Data Mining Based on Multi-dimensional datasets. 16th. Using sampled data for data mining the data mining based on massive data is gradually emerging. Facing the massive volume of data, generally, mining software or algorithms are processed by data sampling. Such errors are not very high, which greatly improves the processing efficiency and success rate. Pay attention to the integrity and integrity of data during sampling to prevent excessive deviations. I have sampled 0.1 billion million rows of table data and extracted 4 million rows. The error tested by the software is 5‰, which is acceptable to the customer. There are also some methods that need to be used in different situations and scenarios, such as using the proxy key and other operations. The advantage is that the aggregation time is accelerated, because the aggregation of the numeric type is much faster than the aggregation of the numeric type. Similar situations need to be addressed based on different needs. Massive Data is a development trend and is increasingly important for data analysis and mining. extracting useful information from massive data is important and urgent, which requires accurate processing and high accuracy, in addition, the processing time is short and valuable information can be obtained quickly. Therefore, the research on massive data is promising and worthy of extensive and in-depth research. Http://blog.csdn.net/DaiZiLiang/archive/2006/12/06/1432193.aspx

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.