Research on parallel Data mining tool platform based on cloud computing

Source: Internet
Author: User
Keywords Cloud
Tags application application layer applications based business business application business knowledge business strategy

With the development of telecom industry, the competition among telecom operators has become fiercer. In order to win in the competition, the right business strategy becomes the key link of the telecom operator's success. Telecom operators have a large number of user data information, using data mining technology, can be in billing data, business order data, network management data, such as the mass user data found business knowledge, for the market to lay the foundation for precision marketing. With the expansion of China Mobile users and the diverse demands of application targets, data mining applications face new challenges.

First of all, the user scale is more and more large, generated by a large number of users of data, including business data, billing data and network management data. For example, a medium-sized provincial company has about 10 million users, so the amount of CDR data produced per year is approximately 12~16TB. For example, a very simple business goal of data mining, after data preprocessing (EXTRACT,TRANSFORM,LOAD,ETL) processing, the algorithm needs to process about 10GB of data. And a province company's network management data is massive, can reach 1TB level of a day.

Secondly, with the increasingly complex and diverse application requirements, data mining applications to its IT support platform for higher computing requirements and storage capabilities, and data mining applications are gradually put forward real-time requirements, timely business strategy can quickly occupy the market.

The above problem presents a new challenge to the traditional data mining system, which is limited by the traditional data mining system running on the centralized platform of Unix minicomputer. At present, with a cluster application as an example, the existing commercial data mining system can only support 1 million users within one months of the knowledge of data discovery, which is far from our actual requirements. Moreover, the traditional IT support platform cost is very high, the high cost will greatly reduce our competitiveness.

The Parallel Data Mining tool (Bc-pdm,blue Carrier based Parallel), developed by China Mobile Research Institute, focuses on using cloud computing technology to realize the storage, analysis, processing and mining of massive data. It provides high reliability and high performance data Mining analysis support tool to the system and network management system.

In terms of system architecture, the parallel Data Mining tool platform based on cloud computing includes three layers, which are distributed computing layer, data mining platform layer and business application layer, specifically

(1) Distributed computing platform layer: including three parts of the function:

L Distributed File System: Provides the distributed data file storage function, provides the high reliability, the high stability storage platform;

L Parallel Programming Environment: Provide MapReduce model, task scheduling, task execution, result feedback and so on, and submit the job function to the platform;

Distributed System Management: realize the Distributed system management of the platform.

(2) Data mining platform layer: including five parts of the function:

L Workflow Module: The realization of each data mining steps and Module general control, scheduling functions;

L Data Loading module: A DFS system that pours the source data from other peripherals into the cloud computing platform;

L Parallel ETL Module: Raw data preprocessing to get mining data, parallel data mining tools to the cloud computing platform to submit to perform the ETL task, by the cloud computing platform implementation and feedback results, stored in DSF;

L Parallel Data Mining algorithm module: To achieve the needs of the data mining algorithm, parallel Data Mining tool platform to the cloud computing platform to submit to perform the task of clustering algorithm, implemented by the cloud computing platform and feedback results, stored in DFS;

L Parallel Result Display module: The result of parallel data mining algorithm is displayed to the user;

(3) Business Application layer: Realize the Telecom business application, for the market department to develop marketing strategy, specific business applications such as: Customer clustering, user career prediction. Users can use parallel data mining tools in two ways:

L based on user GUI interface: Users can perform data loading, ETL operation, data mining algorithm and result display through tools to realize the required application.

L based on the algorithm library API: Users can write application system, invoke the API in the algorithm library to implement the application function.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.