A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
This paper is transferred from http://www.cnblogs.com/lovexinsky/archive/2012/03/09/2387583.html. Thank the Author ~ ~
In the actual work environment, many people will encounter massive data this complex and arduous problem, its main difficulties are as follows:
One, the amount of data is too large, the data in any situation may exist.
If you say there are 10 of data, it's not big enough to check each one, artificial treatment, if there are hundreds of data, you can also consider, if the data to tens, or even billion, it is not manual can be solved, must be handled through tools or procedures, especially in the vast number of data, any situation may exist, for example, There is a problem with the format of the data, especially in the process of processing, the front can be normal processing, suddenly to a local problem appeared, the program terminated.
Second, high software and hardware requirements, system resource occupancy rate is high.
To deal with massive data, in addition to good methods, the most important is the rational use of tools, reasonable allocation of system resources. In general, if the processing of data over TB, minicomputer is to be considered, ordinary machines if there are good ways to consider, but also must increase the CPU and memory, as in the face of an army, the courage of a soldier without a stroke is difficult to win.
Third, a high demand for processing methods and skills.
This is also the purpose of writing this article, good treatment is an engineer's long-term work experience accumulation, but also a summary of personal experience. There is no common approach, but there are general principles and rules.
Here's a detailed description of the experience and techniques of handling massive data:
First, the use of excellent database Tools
There are a lot of database tools manufacturers, the processing of massive data for the use of database Tools is high, the general use of Oracle or DB2, Microsoft recently released SQL Server 2005 performance is also good. In addition, in the BI field: Database, Data Warehouse, multidimensional database, data mining and other related tools to choose, like good ETL tools and good OLAP tools are very necessary, such as informatic,eassbase. In the actual data analysis project, 60 million log data per day is processed, it takes 6 hours to use SQL Server 2000, and SQL Server 2005 takes only 3 hours.
Second, the preparation of excellent program code
Processing data is inseparable from good program code, especially in complex data processing, you must use the program. Good program code is very important for data processing, which is not only the problem of data processing accuracy, but also the problem of efficiency of processing. Good program code should contain good algorithm, including good processing flow, including good efficiency, including good exception handling mechanism.
Three, the massive data carries on the partition operation
It is necessary to partition large amounts of data, for example, for data accessed by year, we can partition by year, different databases have different partitioning methods, but the processing mechanism is basically the same. For example, SQL Server's database partition is to store different data under different filegroups. The different file groups are stored in different partition, so that the data can be dispersed, reduce disk I/O, reduce the system load, but also the log, index, etc. under different partitions.
Iv. establishing an extensive index
For massive data processing, indexing of large tables is a must. The establishment of the index to take into account the specific circumstances, such as for large table grouping, sorting and other fields, to establish the corresponding index, and generally can establish a composite index, the table is often inserted to be careful when the index, the author in processing data, once in an ETL process, When the table is inserted, the index is first deleted, then inserted, indexed, and the aggregation is implemented, the aggregation is completed, the index is inserted again before or after it is finished, so the index has to use a good time, the fill factor of the index and the clustered and nonclustered indexes are considered.
V. Establishment of a caching mechanism
When the amount of data is increased, the general processing tool takes into account the caching problem. Cache Size setting is also related to the success or failure of data processing, for example, the author in processing 200 million data aggregation operations, the cache is set to 100,000/buffer, which is feasible for this level of data volume.
VI. Increase Virtual memory
If system resources are limited and memory hints are insufficient, you can solve them by adding virtual memory. The author in the actual project has encountered 1.8 billion data processing, memory for 1GB, 1 p42.4g CPUs, for such a large amount of data aggregation operation is problematic, the hint of memory is not enough, then use the method to increase virtual memory to solve, in 6 disk partitions were established 6 4096M partition, for virtual memory, so that virtual memory is increased to 4096*6 + 1024 =25600 M solves the problem of insufficient memory in data processing.
Vii. Batching Process
Mass data processing is difficult because of the large amount of information, then one of the techniques to solve the problem of mass data processing is to reduce the amount of data. Mass data can be processed in batches, and then processed data to be merged, so that one by one, in favor of small amount of data processing, not to face the problem of large data, but this method is also due to the potential, if not allowed to split data, but also need to find another way. However, the general data by day, by month, by year and so on storage, can use the first division after the combination of the method of data processing separately.
Viii. use of temporary tables and intermediate tables
When the volume of data increases, the process should be considered in advance rollup. The purpose of this is to make a piecemeal, large table to small table, block processing completed, and then use a certain rule to merge, the processing of temporary tables in the use and the preservation of intermediate results is very important, if for the huge amount of data, large table can not handle, only split into a number of small tables. If you need to process a multi-step rollup, you can follow the steps of the summary step by step, not one statement complete, one breath to eat a fat man.
IX. Optimizing Query SQL statements
In the process of query processing of massive data, the performance of the query's SQL statement has a significant impact on query efficiency, and writing highly efficient SQL scripts and stored procedures is the responsibility of the database worker and a standard for verifying the level of the database worker, in the process of writing SQL statements, such as reducing correlation, It is necessary to design an efficient database table structure with little or no cursor. In the work of the author to try to 100 million rows of data using cursors, running 3 hours without results, it must be used to process the program.
Use text Format for processing
A database can be used for general data processing, if the complex data processing, must use the procedure, then in the procedure operation database and the program operation text Choice, is must select the program operation text, the reason is: The procedure operation text speed is fast, the text processing is not easy to make mistakes, the text storage unrestricted and so on. For example, a large number of web logs are in the form of text or CSV format (text format), it is involved in the processing of data cleaning, is to use the program to deal with, and do not recommend to import the database to do cleaning.
Xi. Custom-Strong cleaning rules and error handling mechanisms
There are inconsistencies in the mass of data, and there is a high likelihood that there will be flaws in some place. For example, the same data in the Time field, some may be non-standard time, the cause may be the application of errors, system errors, etc., this is in the process of data processing, must be developed a strong rules and fault handling mechanism.
12. Create a view or materialized view
The data in the view comes from the base table, the processing of massive data, the data can be scattered in a certain rules of the base table, query or processing can be based on the view, so that the disk I/O, dispersed, as 10 rope hanging a pillar and a pillar hanging a difference.
13, avoid using 32-bit machine (extreme situation)
At present, many of the computers are 32-bit, then write a program to the memory needs of the restrictions, and many of the massive data processing is necessary to consume a lot of memory, which requires better performance of the machine, which the limit on the number of digits is also very important.
14, consider operating system problems
Mass data processing, in addition to the database, processing procedures, and other requirements are relatively high, the requirements of the operating system has been placed in an important position, generally must use the server, but also the security and stability of the system and other requirements are relatively high. Especially to the operating system itself caching mechanism, temporary space processing and other issues need to be considered comprehensively.
The use of data warehouses and multidimensional database storage
Increase the amount of data must be considered OLAP, the traditional report may be 5 or 6 hours out of the result, and based on the cube query may only need a few minutes, so processing a large number of data is OLAP multidimensional analysis, that is, the establishment of data warehouses, the establishment of multidimensional data sets, based on the multidimensional data sets for report presentation and data mining.
16, the use of sampling data, data mining
Data mining based on massive data is rising gradually, faced with huge amount of data, the general mining software or algorithms often use data sampling method to deal with, this error will not be very high, greatly improve the processing efficiency and processing success rate. General sampling should pay attention to the integrity of the data and to prevent excessive deviation. The author has 120 million rows of table data sampling, extraction of 4 million lines, the test software test processing error of 5 per thousand, customers can accept.
There are also methods that need to be used in different situations and situations, such as the use of surrogate keys, which has the advantage of speeding up aggregation time because the aggregation of numeric types is much faster than the aggregation of character types. A similar situation needs to be addressed for different needs.
The massive data is the development tendency, to the data analysis and the excavation also more and more important, it is important and urgent to extract useful information from massive data, which requires accurate processing, high precision, and short processing time to get valuable information quickly, so the research on massive data is very promising, and it is worthy of extensive and in-depth research. The topic of mass data processing (i)-- opening
The problem of large amount of data is a lot of questions that often arise in the interview written test, such as Baidu Google Tencent some of the companies involved in massive data often ask.
The following approach is a general summary of how massive data is handled, although these methods may not completely cover all the problems, but some of these methods can basically deal with most of the problems encountered. Some of the following questions directly from the company's interview written questions, the method is not necessarily optimal, if you have a better way to deal with, welcome to discuss with me.
This paste from the solution to this type of problem, start a series of topics to solve the massive data problem. It is intended to include the following aspects. Bloom Filter Hash bit-map Heap (Heap) double bucket Partition Database index inverted index (inverted index) outside sort trie tree MapReduce
On top of these solutions, we use some examples to analyze the solution of mass data processing problem. Special topics on mass data processing (II.)--bloom Filter
"What is Bloom Filter"
Bloom Filter is a highly space-efficient random data structure that uses a bit array to express a set succinctly and to determine whether an element belongs to the set. This high efficiency of Bloom filter has a price: when judging whether an element belongs to a set, it is possible to mistake elements that do not belong to this set as belonging to this set (false positive). Therefore, Bloom filter is not suitable for those "0 error" applications. In an application where the low error rate is tolerated, Bloom filter saves a great deal of storage space with very few errors. Here is a detailed introduction of bloom filter, not too understand Bo friends can see.
"Scope of Application"
Can be used to implement a data dictionary, data weighing, or set intersection
"Fundamentals and Essentials"
For the principle of simplicity, the bit array +k a separate hash function. The hash function corresponds to the value of the bit array of 1, find if all the hash function corresponding bit is 1 indicates existence, it is obvious that this process does not guarantee that the result of the lookup is 100% correct. It also does not support deleting a keyword that has been inserted because the corresponding bit of the keyword will affect other keywords. So a simple improvement is counting Bloom filter, which uses a counter array instead of a bit array to support deletion.
A more important question is how to determine the size of the bit array m and the number of hash functions based on the number n of the input elements. The error rate is minimal when the number of hash functions k= (LN2) * (m/n). In cases where the error rate is not greater than E, M is at least equal to N*LG (1/e) to represent a collection of any n elements. But M should be bigger, because it also has to ensure that at least half of the bit array is 0, then M should be >=NLG (1/e) *lge probably NLG (1/e) 1.44 times times (LG says 2 logarithm).
For example, if we assume that the error rate is 0.01, then M should be about 13 times times that of N. So k is probably 8.
Note that here m is different from the unit of N, and M is a bit, and N is in the number of elements (exactly the number of different elements). Usually the length of a single element has a lot of bit. So the use of Bloom filter memory is usually saved.
Bloom filter maps the elements in the collection to the array, using K (k as a hash function number) to indicate whether the element is not in the collection at all 1. Counting Bloom Filter (CBF) expands each bit in the bit array into a counter, thus supporting the deletion of the element. Spectral Bloom Filter (SBF) associates it with the number of occurrences of the collection element. SBF uses the minimum value in the counter to approximate the occurrence frequency of the element.
Give you a,b two files, each store 5 billion URLs, each URL occupies 64 bytes, memory limit is 4G, let you find a,b file common URL. If it is three or even n files.
According to this problem we calculate the memory footprint, 4G=2^32 is about 4 billion *8 is probably 34 billion bit,n=50 billion, if the error rate of 0.01 is required about 65 billion bit. Now the 34 billion is available, not much, which may cause the error rate to rise. In addition, if these urlip are one by one corresponding, it can be converted to IP, it is much simpler.
Special topic of mass data Processing (III.)--hash
"What is hash"
Hash, the general translation to do "hash", there is a direct transliteration to "hash", is the arbitrary length of the input (also known as the pre-image), through hashing algorithm, transform into a fixed length of output, the output is hash value. This conversion is a compression map, in which the space of the hash value is usually much smaller than the input space, and different inputs may be hashed out into the same output, and it is not possible to uniquely determine the input value from the hash value. Simply put, a function that compresses messages of any length into a message digest of a fixed length.
Hash is mainly used in the field of information security encryption algorithm, which has a number of different lengths of information into a cluttered 128-bit encoding, these coded values are called hash values. It can also be said that the hash is to find a data content and data to store the mapping between the address.
The features of the array are: easy to address, difficult to insert and delete, and linked lists are characterized by difficult addressing, easy insertion and deletion. So if we can combine the characteristics of both, make an easy way to address, insert delete also easy data structure. The answer is yes, this is the hash table we want to mention, there are many different implementations of the hash table, and I will explain the most common method--the Zipper method, which we can understand as "array of linked lists", as shown in the figure:
The left is obviously a number of arrays, each member of the array includes a pointer to a linked list of the head, of course, this list may be empty, but also may be a lot of elements. We assign elements to different lists based on some of the characteristics of the elements, and according to these features, we find the right list and find the element from the list.
The method by which the element feature transforms an array subscript is a hashing method. Of course there are more than one hashing method, the following list of three more commonly used.
1, Division Hash method
One of the most intuitive of these is the hashing method, which uses the formula:
index = value% 16
learned that the Assembly is aware that modulus is actually through a division operation, so called "division hashing."
2, squared hash method
Index is a very frequent operation, and multiplication is more than division time (for the current CPU, it is estimated that we do not feel), so we consider dividing division into multiplication and a displacement operation. Formula:
Index = (value * value) >> 28
If the numerical distribution is more uniform, this method can get a good result, but I draw the diagram of the various elements of the value of the index is 0--very failure. Perhaps you have a problem, value if it is large, value * value will not overflow. The answer is yes, but our multiplication doesn't care about overflow, because we're not just trying to get the multiplication result, but to get the index.
3, Fibonacci (Fibonacci) hashing method
The disadvantages of square hashing are obvious, so can we find an ideal multiplier instead of using value itself as a multiplier? The answer is yes.
1, for a 16-bit integer, this multiplier is 40503
2, for a 32-bit integer, this multiplier is 2654435769
3, for a 64-bit integer, this multiplier is 11400714819323198485
How do these "ideal multipliers" come out? It's about a law, called the golden Rule, and the most classical expression of the golden rule is undoubtedly the famous Fibonacci sequence, if you are interested in the Internet to look up the "Fibonacci sequence" and other keywords, my math level is limited, do not know how to describe why, The value of the Fibonacci sequence is surprisingly consistent with the orbital radii of the eight planets in the solar system, and it's amazing, right.
For our common 32-bit integers, the formula:
I ndex = (value * 2654435769) >> 28
If you were to scatter Liefa with this Fibonacci, then my diagram would be like this:
Obviously, it is much better to adjust the Fibonacci hash than the original method of fetching the hash.
"Scope of Application"
Quick lookup, deletion of the basic data structure, usually requires total amount of data to be put into memory.
"Fundamentals and Essentials"
hash function selection, for string, Integer, permutation, specific corresponding hash method.
Collision treatment, one is open hashing, also known as Zipper method, the other is closed hashing, also known as the address of the law, opened addressing.
The d in D-left hashing is a number of meanings, so let's simplify the problem and take a look at the 2-left hashing. 2-left hashing refers to dividing a hash table into two halves of equal length, called T1 and T2 respectively, with a hash function for T1 and T2, H1 and H2. When a new key is stored, it is computed with two hash functions, resulting in two addresses H1[key] and H2[key]. At this point, you need to check the H1[key] position in the T1 and the H2[key in the T2 position, which location has been stored (there is a collision) key more, and then store the new key in the location of less load. If the two sides are as many as two locations are empty or have stored a key, the new key is stored in the left T1 child table, 2-left also come from. When looking for a key, you must make a hash of two times and find two locations.
1. Massive log data, extracted one day visit Baidu the most times that IP.
The number of IP is still limited, up to 2^32, so you can consider using a hash of IP directly into memory, and then statistics.
Special topic of mass data Processing (IV.)--bit-map
"What is Bit-map?"
The so-called bit-map is to use a bit bit to mark the corresponding value of an element, and key is that element. Because the bit is used to store the data, the storage space can be greatly saved.
If there is so much to know about Bit-map, let's look at a concrete example, assuming that we want to sort the 5 elements (4,7,2,5,3) within 0-7 (assuming these elements are not duplicates). Then we can use the Bit-map method to achieve the purpose of sorting. To represent 8 digits, we only need 8 bit (1Bytes), first we open 1Byte space, all the bit bits of these spaces are set to 0 (the following figure:)
Then iterate through the 5 elements, first the first element is 4, then the 4 corresponding position is 1 (this can be done p+ (I/8) | (0x01<< (i%8)) Of course, the operation here involves Big-ending and little-ending, which defaults to big-ending, because it's zero-based, so put the fifth position in one (as shown below):
Start building with 50+ products and up to 12 months usage for Elastic Compute Service