The author of this article, Shesu is Admaster data mining director, Cloud computing practitioners, 10 data warehousing and data mining consulting experience, is focused on the distributed platform of mass data mining and machine learning.
The following is the full text of the article:
With the development of the Internet, mobile Internet and IoT, no one can deny that we have actually ushered in a massive data era, data research company IDC expects 2011 total data will reach 1.8 trillion GB, the analysis of these massive data has become a very important and urgent demand.
As an Internet data analysis company, we are "revolt" in the field of analysis of massive data. Over the years, we've tried almost every possible big data-analysis approach, with tough business needs and data pressures, and eventually landed on the Hadoop platform.
Hadoop has irreplaceable advantages in scalability, robustness, computational performance and cost, and in fact it has become a major data analysis platform for the current Internet enterprise. This paper mainly introduces a multidimensional analysis and data mining platform architecture based on Hadoop platform.
Classification of large data analysis
The Hadoop platform is more targeted to the business, in order to let you determine whether it is in line with your business, now roughly from a few angles to the large data analysis of the business needs of classification, for different specific needs, should adopt different data analysis framework.
1, according to the real-time data analysis, divided into real-time data analysis and off-line data analysis two kinds.
Real-time data analysis is generally used for products such as finance, mobile and internet consumer, which often require the analysis of hundreds of billions of rows of data to be returned in a few seconds to achieve the goal of not affecting the user experience. To meet this demand, a well-designed traditional relational database can be used as a parallel processing cluster, or some memory computing platform or an HDD architecture, which will undoubtedly require high hardware and software costs. At present, the new large-scale data real-time analysis tools have EMC Greenplum, SAP Hana and so on.
For most of the feedback time requirements are not so harsh applications, such as off-line statistical analysis, machine learning, search engine reverse index calculation, recommendation engine calculation, etc., should use off-line analysis, through data Acquisition tool to import log data into a dedicated analysis platform. However, in the face of massive data, the traditional ETL tools are often completely ineffective, mainly because the data format conversion cost is too large, in the performance can not meet the needs of massive data collection. Internet enterprise's massive data acquisition tool, has the Facebook open source Scribe, the LinkedIn open source Kafka, the Taobao Open source Timetunnel, the Hadoop Chukwa and so on, all can meet hundreds of MB per second log data collection and transmission demand, and upload the data to the Hadoop central system.
2, according to the data volume of large data, divided into memory level, BI level, a mass level of three kinds.
The memory level here refers to the amount of data that does not exceed the maximum memory size of the cluster. Do not underestimate the capacity of today's memory, Facebook cache in the memory of the memcached data up to 320TB, and the current PC server, memory can also be more than hundred GB. Therefore, some memory database can be used, the hotspot data resident in the memory, so as to obtain very fast analysis capabilities, is very suitable for real-time analysis business. Figure 1 is a practical and feasible MongoDB analysis architecture.
▲ Figure 1 MongoDB architecture for real-time analysis
There are some stability problems in the MongoDB cluster, which can cause periodic write jam and master-slave synchronization failure, but it is still a potential nosql for high-speed data analysis.
In addition, most of the current service vendors have launched a solution with more than 4GB SSD, using memory +SSD, can also easily achieve the performance of memory analysis. With the development of SSD, the analysis of memory data will be widely used.
The bi level refers to the amount of data that is too large for memory, but can typically be analyzed in a traditional bi product and a specially designed BI database. The current mainstream BI products have support for TB-level data analysis solutions. A wide variety, not specifically enumerated.
The mass level refers to the amount of data that has been completely invalidated or cost prohibitive for the database and BI products. There are also many excellent enterprise-class products with massive data levels, but based on the cost of hardware and software, most Internet companies currently use Hadoop's HDFs Distributed File system to store data and use MapReduce for analysis. A multidimensional data analysis platform based on MapReduce on Hadoop is introduced later in this article.
3. The complexity of data analysis algorithm
According to different business requirements, the algorithm of data analysis is very different, and the algorithm complexity of data analysis is closely related to the architecture. For example, Redis is a very high performance memory Key-value NoSQL, which supports simple sets such as list and set, SortedSet, and if your data analysis needs are simply sorted, the linked list can be solved and the total amount of data is not greater than memory ( Specifically, memory plus virtual memory is divided by 2, so there is no doubt that the use of Redis will achieve very surprising analytical performance.
There are many easy parallel problems (embarrassingly Parallel), the calculation can be decomposed into a completely independent part, or very simple to be able to transform the distributed algorithm, such as large-scale face recognition, graphics rendering, such as the problem is naturally using parallel processing cluster is more appropriate.
For most statistical analysis, machine learning problems can be rewritten using mapreduce algorithms. MapReduce is currently best at the field of computing flow statistics, recommendation engine, trend analysis, user behavior analysis, data mining classifier, distributed index and so on.
(Responsible editor: The good of the Legacy)