More and more applications involve big data, the attributes of these large data, including quantity, speed, diversity, and so on, are presenting a growing complexity of large data, so the analysis of large data is particularly important in large data areas, which can be a decisive factor in determining the value of final information. Based on this, what are the methods and theories of large data analysis?
Five basic aspects of large data analysis
Predictiveanalyticcapabilities (predictive analysis capability)
Data mining allows analysts to better understand data, and predictive analysis allows analysts to make predictive judgments based on visual analysis and data mining results.
Dataqualityandmasterdatamanagement (Data quality and data management)
Data quality and data management are some of the best practices in management. Processing data through standardized processes and tools ensures a pre-defined, high-quality analysis.
Analyticvisualizations (visual analysis)
Data visualization is the basic requirement for data analysis tools, whether it is for data analysis experts or ordinary users. Visualization can visually display data, let the data speak for themselves, let the audience hear the results.
Semanticengines (semantic engine)
We know that because the diversity of unstructured data brings new challenges to data analysis, we need a range of tools to parse, extract, and analyze data. The semantic engine needs to be designed to intelligently extract information from the document.
DATAMININGALGORITHMS (Data mining algorithm)
Visualization is for people to see, data mining is for the machine to see. Clustering, segmentation, outlier analysis and other algorithms let us delve into the data inside, mining value. These algorithms deal not only with the amount of large data, but also with the speed of large data.
If big data is really the next major technological innovation, we'd better focus on the benefits that big data can bring, not just the challenges.
Large data processing
The three major transformations in the age of data processing: To avoid sampling, to be efficient, not to be absolutely accurate, to be relevant and not to cause or effect. There are a lot of concrete data processing methods, but according to the long time practice, the author summarizes a basic large data processing flow, and this process should be able to straighten out the large numbers of processing help. The whole process can be summarized as four steps, namely collection, import and preprocessing, statistics and analysis, and mining.
Large data acquisition refers to the use of multiple databases to receive data from the client, and users can use these databases for simple query and processing work. For example, the ICC uses traditional relational databases such as MySQL and Oracle to store every transaction data, and in addition, NoSQL databases such as Redis and MongoDB are often used for data collection.
In the process of collecting large data, its main characteristics and challenges are high concurrency, because at the same time there will be tens of thousands of users to access and operations, such as train ticket ticketing website and Taobao, their concurrent visits to the peak of millions, so need to deploy a large number of databases on the acquisition side to support. And how to load balance and fragment between these databases is really a need for in-depth thinking and design.
Statistics and analysis mainly utilizes the distributed database, or the distributed computing cluster to carry on the common analysis and the classification summary to the massive data which is stored in it, in order to satisfy the most common analysis demand, in this aspect, some real-time demand will use the EMC Greenplum, the Oracle Exadata, And MySQL-based column storage infobright and so on, while some batches, or the need for semi-structured data, can use Hadoop. The main features and challenges of this part of statistics and analysis are the large amount of data involved in the analysis, which will occupy the system resources, especially I/O.
Although the acquisition end itself will have a lot of databases, but if you want to effectively analyze these massive data, or should be the data from the front-end to a centralized large distributed database, or distributed storage cluster, and can be introduced on the basis of some simple cleaning and preprocessing work. There are also users who use storm from Twitter to stream the data in the import to meet the real-time computing needs of some businesses. The characteristics and challenges of the import and preprocessing process are mainly the amount of data that is imported, and the number of imports per second often reaches hundred megabytes or even gigabit levels.
Unlike the previous statistics and analysis process, data mining generally has no pre-set themes, mainly in the existing data based on the calculation of various algorithms, so as to play a predictive effect, so as to achieve a number of high-level data analysis requirements. The typical algorithms have K for clustering, SVM for statistical learning and naive Bayes for classification, and the main tools used are mahout of Hadoop. The characteristics and challenges of the process are mainly the complexity of the algorithm used for mining, and the computation involves a large amount of data and computation, and the commonly used data mining algorithms are single thread.