Large data modeling needs to understand nine major forms

Last Update:2014-12-18 Source: Internet

Author: User

Keywords Data mining understanding can through

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Data mining is the process of discovering and interpreting knowledge (or patterns) from the data using business knowledge, which is a new knowledge created in natural or artificial form.

The current form of data mining was born in the field of practice in the 20th century http://www.aliyun.com/zixun/aggregation/16333.html ">90 era, is a form of business analysis that is supported by the development of an integrated data mining algorithm platform. Perhaps because data mining stems from practice rather than theory, it is less noticeable in the process of understanding. The development of CRISP-DM in the late the 1990s has gradually become a standardized process of data mining, which has been successfully applied and followed by more and more data mining practitioners.

While CRISP-DM can guide how to implement data mining, it does not explain what data mining is or why it is appropriate to do so. In this article I will elaborate on my nine guidelines for data mining or "laws" (most of which are well known to practitioners) and other familiar explanations. Begin to explain the data mining process theoretically, not just in description.

My goal is not to comment on CRISP-DM, but many of CRISP-DM's concepts are critical to understanding data mining, and this article will also rely on the common terminology of CRISP-DM. CRISP-DM is just the beginning of the process.

First, goal law: Business goals are the source of all data solutions.

It defines the subject of data mining: Data mining focuses on solving business industry problems and achieving business goals. Data mining is not primarily a technology, but a process, the business goal is its core. No business goals, no data mining (whether or not this statement is clear). So this guideline can also be said that data mining is a business process.

Second, knowledge Law: Business knowledge is the core of every step of data mining process.

This defines a key feature of the data mining process. A naïve interpretation of CRISP-DM is that business knowledge is only defined and implemented as the end result of the data mining process, which misses a key attribute of the data mining process, namely that business knowledge is at the core of each step.

For ease of understanding, I use the CRISP-DM phase to illustrate:

Business understanding must be based on business knowledge, so data mining objectives must be the mapping of business objectives (this mapping is also based on data knowledge and data mining knowledge);

Data understanding using business knowledge to understand the data related to business issues and how they relate;

Data preprocessing is the use of business knowledge to shape data, so that business problems can be presented and answered (more detailed third-preparation law);

Modeling is the use of data mining algorithms to create predictive models, while interpreting the characteristics of model and business objectives, that is, understanding the business relevance between them;

Assessment is the impact of the model on understanding the business;

Implementation is the data mining results for the business process;

In short, without business knowledge, every step in the data mining process is invalid, and there is no "pure technology" step. The business knowledge guidance process produces beneficial results and enables those beneficial outcomes to be recognized. Data mining is a repetitive process, business knowledge is its core, driving the continuous improvement of results.

The reasons behind this can be explained by "the performance of The gap" (Chasm of Representation) (Alan Montgomery's view on data mining in the 1990s). Montgomery points out that data mining targets involve real business, but data can only represent a part of reality; there is a gap (or "gap") between the data and the real world. In the process of data mining, business knowledge to fill this gap, no matter what found in the data, only the use of business knowledge interpretation to show its importance, any omission in the data must be made up through business knowledge. Only business knowledge can compensate for this loss, which is why business knowledge is at the heart of every step in the data mining process.

Third, Prep Law: Data preprocessing is more important than any other process in data mining.

This is the famous maxim of data mining, the most laborious thing in data mining project is data acquisition and preprocessing. Informally, it takes up 50%-80% of the project time. The simplest explanation can be summed up as "data is difficult", often using automated mitigation of this "problem" data acquisition, data cleansing, data conversion and other data preprocessing parts of the workload. While automation technology is beneficial, proponents believe that this technology can reduce the amount of work that is involved in data preprocessing, but it is also a misconception that data preprocessing is necessary for data mining.

The purpose of data preprocessing is to transform data mining into formatted data, making it easier for analytical techniques such as data mining algorithms to use it. Changes in any form of data (including cleanup, maximum minimum conversion, growth, etc.) imply a change in the problem space, so this analysis must be exploratory. This is an important reason for data preprocessing, and it takes up so much work in the data mining process that data diggers can easily manipulate the problem space, making it easy to find the right way to analyze them.

There are two ways of "shaping" This problem space. The first approach is to transform the data into fully formatted data that can be parsed, for example, most data mining algorithms require a single tabular form of data, and a record is a sample. Data diggers know what kind of data form the algorithm needs, so they can transform the data into an appropriate format. The second approach is to enable the data to contain more information about the business problem, such as some data mining issues in some areas that the data diggers can know through business knowledge and data knowledge. With these areas of knowledge, data diggers can easily find a suitable technical solution by manipulating the problem space.

Therefore, through the business knowledge, the data knowledge, the data mining knowledge fundamentally makes the data preprocessing more handy. These aspects of data preprocessing are not implemented through simple automation.

This law also explains a doubtful phenomenon, that is, although data acquisition, cleaning, fusion and other ways to create a data warehouse, but data preprocessing is still essential, still occupy more than half of the data mining process workload. Furthermore, as CRISP-DM shows, even after the main data preprocessing phase, further data preprocessing is necessary in the iterative process of creating a useful model.

12 Next

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More