With the development of the Internet, mobile Internet and IoT, no one can deny that we have actually ushered in a massive data era, data research company IDC expects 2011 total data will reach 1.8 trillion GB, the analysis of these massive data has become a very important and urgent demand.
As an Internet data analysis company, we are "revolt" in the field of analysis of massive data. Over the years, we've tried almost every possible big data-analysis approach, with tough business needs and data pressures, and eventually landed on the Hadoop platform.
Hadoop has irreplaceable advantages in scalability, robustness, computational performance and cost, and in fact it has become a major data analysis platform for the current Internet enterprise. This paper mainly introduces a multidimensional analysis and data mining platform architecture based on Hadoop platform.
Classification of large data analysis
The Hadoop platform is more targeted to the business, in order to let you determine whether it is in line with your business, now roughly from a few angles to the large data analysis of the business needs of classification, for different specific needs, should adopt different data analysis framework.
According to the real-time data analysis, the real-time data analysis and off-line data analysis are divided into two kinds.
Real-time data analysis is generally used for products such as finance, mobile and internet consumer, which often require the analysis of hundreds of billions of rows of data to be returned in a few seconds to achieve the goal of not affecting the user experience. To meet this demand, a well-designed traditional relational database can be used as a parallel processing cluster, or some memory computing platform or an HDD architecture, which will undoubtedly require high hardware and software costs. At present, the new large-scale data real-time analysis tools have EMC Greenplum, SAP Hana and so on.
For most of the feedback time requirements are not so harsh applications, such as off-line statistical analysis, machine learning, search engine reverse index calculation, recommendation engine calculation, etc., should use off-line analysis, through data Acquisition tool to import log data into a dedicated analysis platform. However, in the face of massive data, the traditional ETL tools are often completely ineffective, mainly because the data format conversion cost is too large, in the performance can not meet the needs of massive data collection. Internet enterprise's massive data acquisition tool, has the Facebook open source Scribe, the LinkedIn open source Kafka, the Taobao Open source Timetunnel, the Hadoop Chukwa and so on, all can meet hundreds of MB per second log data collection and transmission demand, and upload the data to the Hadoop central system.
According to the data quantity of large data, divide into memory level, BI level, massive level three kinds.
The memory level here refers to the amount of data that does not exceed the maximum memory size of the cluster. Do not underestimate the capacity of today's memory, Facebook cache in the memory of the memcached data up to 320TB, and the current PC server, memory can also be more than hundred GB. Therefore, some memory database can be used, the hotspot data resident in the memory, so as to obtain very fast analysis capabilities, is very suitable for real-time analysis business. Figure 1 is a practical and feasible MongoDB analysis architecture.
Fig. 1 MongoDB architecture for real-time analysis MongoDB large cluster has some stability problems at present, it will occur periodic write blocking and master-slave synchronization failure, but it is still a potential for high-speed data analysis of the NoSQL.
In addition, most of the current service vendors have launched a solution with more than 4GB SSD, using memory +SSD, can also easily achieve the performance of memory analysis. With the development of SSD, the analysis of memory data will be widely used.
The bi level refers to the amount of data that is too large for memory, but can typically be analyzed in a traditional bi product and a specially designed BI database. The current mainstream BI products have support for TB-level data analysis solutions. A wide variety, not specifically enumerated.
The mass level refers to the amount of data that has been completely invalidated or cost prohibitive for the database and BI products. There are also many excellent enterprise-class products with massive data levels, but based on the cost of hardware and software, most Internet companies currently use Hadoop's HDFs Distributed File system to store data and use MapReduce for analysis. A multidimensional data analysis platform based on MapReduce on Hadoop is introduced later in this article.
Algorithmic Complexity of data analysis
According to different business requirements, the algorithm of data analysis is very different, and the algorithm complexity of data analysis is closely related to the architecture. For example, Redis is a very high performance memory Key-value NoSQL that supports simple sets such as list and set, SortedSet, and if your data analysis needs are simply sorted, the linked list can be solved, At the same time the total amount of data is not greater than memory (accurately memory plus virtual memory divided by 2), then the use of Redis will undoubtedly achieve very amazing analysis performance.
There are many easy parallel problems (embarrassingly Parallel), the calculation can be decomposed into a completely independent part, or very simple to be able to transform the distributed algorithm, such as large-scale face recognition, graphics rendering, such as the problem is naturally using parallel processing cluster is more appropriate.
For most statistical analysis, machine learning problems can be rewritten using mapreduce algorithms. MapReduce is currently best at the field of computing flow statistics, recommendation engine, trend analysis, user behavior analysis, data mining classifier, distributed index and so on.
Fig. 2 The ranks of the Rcfile
Some problems in the face of large data OLAP analysis
OLAP analysis requires a large number of data groupings and inter-table associations, which are clearly not the strengths of nosql and traditional databases, and often must use specific databases for BI optimization. For example, most of the databases for BI optimization use column storage or mixed storage, compression, delay loading, pre-statistics of storage blocks, and fragment indexing techniques.
OLAP analysis on the Hadoop platform also has this problem, and Facebook developed the Rcfile data format for hive, using some of the above optimization techniques to achieve better data analysis performance. As shown in Figure 2.
However, for the Hadoop platform, simply by using hive to imitate SQL, for data analysis is not enough, first hive although HIVEQL translation MapReduce is optimized, but still inefficient. Multidimensional analysis is still to do with the fact table and Dimension table Association, the dimension of a lot of performance must be significantly reduced. Second, the Rcfile mixed storage mode, in fact, restricts the data format, that is, the data format is designed for specific analysis, once the analysis of the business model changes, the cost of large-scale data conversion format is extremely large. Finally, HIVEQL is still very unfriendly to OLAP business analysts, and dimensions and metrics are the analytical languages that are directly targeted at business people.
And the current OLAP has the biggest problem is: flexible business, will inevitably lead to business models often change, and business dimensions and metrics once changes, technicians need to redefine the entire cube (multidimensional cube), the business staff can only multidimensional analysis on this cube, This limits the business people to quickly change the perspective of problem analysis, so that the so-called BI system into a rigid day-to-day reporting system.
Using Hadoop for Multidimensional Analysis, first of all, to solve the above dimensions difficult to change the problem, the use of Hadoop data unstructured features, the data collected in itself is a large number of redundant information. At the same time, a lot of redundant dimension information can be consolidated into the fact table, which can change the angle of problem analysis flexibly under the redundancy dimension. Secondly, using Hadoop mapreduce powerful parallel processing ability, no matter how much the dimension of OLAP analysis increases, overhead does not increase significantly. In other words, Hadoop can support a huge cube that contains countless dimensions you think or expect, and each multidimensional analysis can support hundreds of dimensions without significantly impacting the performance of the analysis.
Fig. 3 Mdx→mapreduce Sketch Map Therefore, our large data analysis architecture, supported by this huge cube, directly converts dimensions and metrics to business people, who define dimensions and metrics by themselves, translating business dimensions and metrics directly into mapreduce operations, And eventually generate the report. Can be easily understood as a user-defined "MDX" (Multidimensional expressions, or Multidimensional cube query) language →mapreduce conversion tools. At the same time, OLAP analysis and presentation of report results are still compatible with traditional BI and reporting products. As shown in Figure 3.
Figure 3 shows that, on a yearly income, the user can set a dimension of their own. In addition, users can customize dimensions on columns, such as combining gender and academic qualifications into one dimension. Because of the unstructured characteristics of Hadoop data, dimensions can be arbitrarily divided and reorganized according to business requirements.
Figure 4 Hadoop Multidimensional Analysis Platform architecture diagram
The architecture of a Hadoop Multidimensional Analysis platform
The whole structure consists of four parts: Data acquisition module, data redundancy module, dimension definition module and parallel analysis module. As shown in Figure 4.
The Data acquisition module adopts the Cloudera Flume, which transmits and merges the massive small log files, and ensures the data transmission security. After a single collector outage, the data will not be lost, and the agent data can be automatically transferred to other colllecter processing, will not affect the entire collection system operation. As shown in Figure 5.
Data redundancy modules are not required, but if there is not enough dimension information in the log data or if you need to increase the dimension more frequently, you need to define the data redundancy module. Define dimension information and sources (databases, files, memory, and so on) that require redundancy through a redundant dimension definition, and specify how to extend the information to the data log. In the massive data, the data redundancy module is often the bottleneck of the whole system, it is suggested to use some faster memory NoSQL to redundancy the original data, and to use as many nodes as possible to carry on the parallel redundancy, or it is possible to execute the batch map in Hadoop to transform the data format.
Fig. 5 Acquisition Module
Figure 6 The logic of the core module
Figure 7 MapReduce Workflow Example Dimension definition module is a front-end module for business users, the user defines dimensions and metrics from the data log by visual definition, and automatically generates a Multidimensional Analysis language. You can also use a visual parser to perform the Multidimensional Analysis command that you just defined through the GUI.
The Parallel Analysis module accepts the Multidimensional Analysis command submitted by the user, resolves the command to map-reduce through the core module, submits it to the Hadoop cluster, and generates the report for presentation at the Report Center.
The core module is to transform Multidimensional Analysis language into MapReduce parser, read user-defined dimension and measure, translate user's Multidimensional Analysis command into MapReduce program. The concrete logic of the core module is shown in Figure 6.
The assembly of map and reduce classes based on the jobconf parameter in Figure 6 is not complicated, and the difficulty is that many practical problems are difficult to resolve through a mapreduce job, and the workflow must be composed of multiple mapreduce jobs (WorkFlow). Here is the part that needs to be tailored to your business. Figure 7 is an example of a simple mapreduce workflow.
MapReduce output is generally the result of statistical analysis, the amount of data compared to the input of the mass data will be much smaller, so you can import the traditional data reporting products to show.
Concluding
Of course, such a multidimensional analysis architecture is not without drawbacks. Since the mapreduce itself is a brute force to scan most of the data for calculation, it cannot be optimized for conditional queries like traditional bi products, nor does it have a caching concept. Often many small queries require "selectmen". Still, open source Hadoop solves many people's analysis problem under large data, it is really "boundless".
The software and hardware of Hadoop cluster is very low, and the cost of storage and calculation per GB is 1% or even 1 per thousand of other enterprise products, and the performance is excellent. We can easily carry out hundreds of billions and even trillions of data-level multidimensional statistical analysis and machine learning.
On the June 29 Hadoop Summit 2011, Yahoo! stripped out a company Hortonworks dedicated to Hadoop development and operation. Cloudera brings a lot of assistive tools, mapr a parallel computing platform known as three times times the mapreduce speed of Hadoop. Hadoop will soon usher in the next generation of products, it must have a more powerful analytical capabilities and more convenient use, so that the real easy to face the challenges of future massive data.
(Responsible editor: admin)