An hour to understand data mining ⑤ data mining steps and common clustering, decision tree, and CRISP-DM concepts

Source: Internet
Author: User

An hour to understand data mining ⑤ data mining steps and common clustering, decision tree, and CRISP-DM concepts

Next Series 4:

An hour to understand data mining ①: Resolving common Big Data application cases

One hour to understand data mining ②: Application of classification algorithm and mature case analysis

An hour to understand data mining ③: A detailed description of Big Data mining classification technology

One hour to understand data mining ④: The principle of business intelligence interpretation of the nine laws of data mining

There are many different ways to implement data mining, and if you just pull the data into an Excel table, it's just data analysis, not data mining. This article mainly explains the basic standard flow of data mining. CRISP-DM and Semma are two common data mining processes.

General steps for data mining

From the data itself, data mining usually requires , data integration, data specification, data cleansing, data transformation, data mining implementation process, pattern evaluation and knowledge representation 8 steps.

Step (2) Data integration : The data of different sources, formats and characteristics are organically concentrated in a logical or physical way, thus providing a comprehensive data sharing for enterprises.

Step (3) Data specification: If you perform most data mining algorithms, even on a small amount of data will take a long time, while doing business operations data mining is often very large data volume. The data specification technique can be used to get the data set of the protocol representation, it is much smaller, but still close to preserving the integrity of the original data, and the implementation of data mining after the protocol is the same or almost identical to the results of the pre-protocol implementation.

Step (4) Data cleanup: Some of the data in the database is incomplete (some of the attributes that are of interest are missing attribute values), are noisy (contain incorrect attribute values), and are inconsistent (the same information is represented differently), so data cleanup is required, complete, correct, Consistent data is stored in the Data warehouse. Otherwise, the results of the excavation will be unsatisfactory.

Step (5) Data transformation: Transform data into a form suitable for data mining by means of smoothing aggregation, data generalization, normalization, etc. For some real-type data, it is also an important step to transform data through conceptual layering and data discretization.

Step (6) Data mining process: According to data information in Data Warehouse, choose appropriate analysis tools, apply statistical method, case reasoning, decision tree, rule inference, fuzzy set, even neural network, genetic algorithm to process the information, obtain
The analysis information used.

Step (7) mode evaluation: from a business perspective, the correctness of data mining results is verified by industry experts.

Step (8) Knowledge Representation: The analysis information obtained by data mining is presented to the user visually, or as new knowledge is stored in the Knowledge base for use by other applications.

The data mining process is a recurring process, and each step, if not achieved, needs to go back to the previous steps to readjust and execute. Not every piece of data mining work requires every step listed here, such as a job
The step (2) can be omitted when there are no more than one data source in the Step (3) Data specification, step (4) Data cleansing, step (5) Data transformation and collectively referred to as data preprocessing. In data mining, at least 60% of the cost may be spent in the step (1)  phase, where at least 60% of the effort and time is spent in the data preprocessing process.

a few common concepts in data mining

In addition to the classifications we mentioned earlier, there are some concepts that we commonly use in data mining, such as clustering algorithms, time series algorithms, estimation and prediction, and correlation algorithms. In this section, we'll cover a few common concepts to deepen the reader's data mining
Understanding of the excavation.


The so-called clustering is the aggregation of a class or cluster (Cluster), and a class is a collection of data objects.

As with classification, the purpose of clustering is to divide all the objects into different groups, but the biggest difference from the classification algorithm is that the clustering algorithm is not known to divide the data into groups, and it does not know which variables to rely on.

Clustering, sometimes called segmentation, refers to a group of people with the same characteristics, averaging the features to form a "eigenvector" or "vector". A cluster system can often divide similar objects into groups or more subsets (subset) by means of static classification, so that member objects in the same subset have similar properties. Clustering is a report that provides a commercial to directly provide different visitor groups or customer group characteristics. The clustering algorithm is one of the core technologies of data mining, but besides its own algorithm application, cluster analysis can be used as a preprocessing step of other analytic algorithms in data mining algorithm.

is a kind of display of clustering algorithm. The Cluster1 and Cluster2 in the graph represent the two kinds of samples computed by the clustering algorithm respectively. Hit "+" is the Cluster1, while playing "0" is marked by the Cluster2.

In business, clustering can help market analysts separate different consumer groups from the consumer database, and summarize the consumption patterns or consumption habits of each category of consumer. As a module in data mining, it can be used as a separate tool to discover some deep-seated information distributed in a database, or to focus on a particular class for further analysis and to summarize the characteristics of each type of data.

In business, clustering can help market analysts separate different consumer groups from the consumer database, and summarize the consumption patterns or consumption habits of each category of consumer. As a module in data mining, it can be used as a separate tool to discover some deep-seated information distributed in a database, or to focus on a particular class for further analysis and to summarize the characteristics of each type of data.

The clustering algorithm can be divided into partition method (partitioning Methods), hierarchical method (hierarchical Methods), density-based method (density-based Methods), Grid-based method (grid-based Methods) and model-based methods (model-based Methods).

For example, the following scenarios are more suitable for application clustering algorithms, while there are corresponding commercial applications:

What specific symptoms of aggregation may portend a particular disease?

What kind of customers are renting the same type of car?

What features can be added to the online game to attract people?

Which customers are the customers we want to keep for a long time?

In addition to its own application, the clustering algorithm can be supplemented by other data mining methods, such as clustering algorithm can be used in the first step of data mining, because the individual similarity in different clusters may be quite different. For example, what kind of promotion is best for customers? For this kind of problem, first to the entire customer aggregation, the customer group in their own aggregation, and then to each different aggregation, and then through the other data mining algorithm analysis, the effect will be better.

In the following article we will also describe in detail how the clustering algorithm is implemented. The RFM model mentioned in this paper is also a data mining model based on clustering algorithm. The RFM clustering model is also the most frequently used model in customer relationship management in the marketing field.

Estimating and predicting

Estimation (estimation) and prediction (prediction) are more commonly used in data mining. The estimation application is used to guess the current unknown value, and the prediction application is an unknown value that predicts the future. Estimates and predictions can use the same algorithm in many cases. Estimates are usually used to fill in the blanks for an existing but unknown value, and the predicted numeric objects will occur in the future, and often do not exist at this time. For example, if we don't know someone's income, we can estimate it by the amount that is closely related to the income, and then find others with similar characteristics, and use their income to estimate the income and credit of an unknown person. Or in the case of a person's future earnings, we can analyze the relationship between income and variables and the changes in time series based on historical data to predict what the specific income will be at some point in the future.

Estimation and prediction can also be used in many cases. For example, we can estimate the number of children in a family and the family structure according to the purchase pattern. Or, depending on the purchase model, estimate the income of a household, and then predict the number of products and quantities that the family will need in the future, and the point in time when they need them.

Data analysis for estimates and forecasts can be called predictive analytics (predictive analysis), and because of the prevalence of applications, predictive analytics is now being used by many business customers and practitioners in the data mining industry as a synonym for data mining.

Regression analysis, which we often hear in data analysis, is a method of analysis that is often used to estimate and predict Regression. The so-called regression analysis, or simply regression, refers to the technology that predicts the correlation between multiple variables, and the application of this technology in data mining is very extensive.

Decision Tree

Of all the data mining algorithms, the first decision tree mentioned in is probably the most understandable data mining process. A decision tree is essentially a flowchart of the problem or data point that causes a decision to be made. For example, a decision tree that buys a car can start with a new 2012-year-old car, ask for the car, and then ask if the user needs a power car or an economy car, until it is determined what the user needs most. The decision tree system tries to create the optimal path and sort the problem so that a decision can be made with minimal steps.

According to statistics, in the 2012, the data mining industry to use the highest frequency of the three algorithms are decision tree, regression and clustering analysis. And because of the intuition of decision tree, almost all the professional books of data mining start from a certain decision tree algorithm: such as Id3/c4.5/c5.0,cart,quest,chaid.

Some decision trees are done very finely, using most of the data properties, we may break into a misunderstanding, because in the decision tree algorithm we need to avoid a problem is to build the decision tree too large, too complex. Overly complex decision trees tend to be overly fitting (over-fitting), unstable, and sometimes impossible to interpret.

At this point we can decompose a large decision tree into smaller decision trees to solve the problem.

Let's look at a commercial decision tree instance. Presented in this paper is a decision tree built with IBM SPSS Modeler Data Mining software, which is a decision tree model used by American Commercial Bank to judge customer's credit rating.

It is a decision tree based on income, credit card number and age, and the threshold is divided by 80% accuracy. The first branch looked at revenue, set up two key data dividers and divided the population into 3 groups according to income: low-income, middle-income and high-income. Low-income nodes become leaf nodes directly, and 82.0976% of the group's credit rating is poor (bad), and the number or age of credit cards does not help the classification of credit ratings. The decision tree's second level of judgment is based on the number of credit cards already owned. As a judgment, high-income people can be divided. One of the number of cards in 5 or more of the 82.4176% credit rating is high-quality (good), and the number of card holders in 5 below, up to 96.8944% of the people's credit rating is high quality. Because the tree has a total of 6 leaf nodes, we finally divided into 6 groups of people, including a group of credit rating of high-quality people accounted for 56.3147%, it is impossible to judge. One of the best on the data is the high-income and credit card numbers of 5 people, who judged them to be a high-quality credit rating of 96.8944% accuracy.

If we have other data in hand, such as whether there is a car in the room, whether or not to marry, then through the test, we can further improve the precision of the decision tree.


In 1999, CRISP-DM Special Interest Group, sponsored by the European Union (European Commission), by SPSS, DaimlerChrysler, NCR and Ohra The organization developed and refined the CRISP-DM (Cross-industry standard Process for data Mining) and conducted a practical trial of a large scale data mining project.

CRISP-DM provides a comprehensive review of the data mining lifecycle. It includes the corresponding cycle of the project, their respective tasks and their relationship to these tasks. In this description layer, it is impossible to identify all relationships. The existence of relationships between all data mining tasks is dependent on the user's purpose, background and interest, and most importantly the data. The SIG organization has released the electronic version of CRISP-DM Process Guide and User Manual. The official website of CRISP-DM is In this organization, in addition to SPSS is the data mining software provider, several other initiators are the application of data mining. So CRISP-DM and SPSS have developed the SPSS Modeler fit very well.

The life cycle of a data mining project consists of six phases. The order of the six phases is not fixed, and we often need to adjust these stages back and forth. This depends on whether the output of a particular task in each stage or stage is the input required for the next phase, and the middle arrow points to the most important and highly dependent phase relationships.

The outermost circle in the upper part represents the loop nature of the data mining itself, and the process of representing another data mining after each solution has been published has begun. The knowledge gained in this process can trigger new, often more focused, business issues. Subsequent processes can benefit from the previous process.

We interpret the six phases of the CRISP-DM data mining lifecycle, that is, the concept in the following:

Business Understanding (Understanding)

The initial phase focuses on understanding the project objectives and understanding the requirements from a business perspective, while translating this knowledge into the definition of data mining issues and the initial plan for accomplishing the goals.

Data Understanding (Understanding)

The data understanding phase begins with the initial data collection, through the processing of some activities, to familiarize yourself with the data, to identify the quality of the data, to discover the internal properties of the data for the first time, or to probe the assumptions that generate the implied information from the subset of interest.

Data Preparation (preparation)

The data preparation phase includes all activities that construct the final dataset in data that has never been processed. This data will be the input value of the model tool. This phase of the task can be executed multiple times, without any prescribed order. Tasks include the selection of tables, records, and attributes, and the conversion and cleansing of data for model tools.

Modeling (Modeling)

At this stage, different model techniques can be selected and applied, and the model parameters are adjusted to the optimal values. Generally, some techniques can solve the same kind of data mining problem. Some techniques have special requirements for data formation, so it is often necessary to jump back to the data preparation phase.

Evaluation (Evaluation)

At this stage, you have developed a high-quality display model from the perspective of data analysis. Before you begin the final deployment of the model, it is important to thoroughly evaluate the model, examine the steps to construct the model, and ensure that the model can accomplish the business goals. The key objective of this phase is to determine whether there are important business issues that are not adequately considered. At the end of this phase, a decision on the use of data mining results must be achieved.

Next trailer: Data mining Assessment and visualization of results

Excerpt from Tan Lei's book "Big Data Mining". To be Continued ...


Reprint please indicate from 36 Big Data ( 36 Big Data» One hour understanding data mining ⑤ Data mining steps and common clustering, decision trees, and CRISP-DM concepts

An hour to understand data mining ⑤ data mining steps and common clustering, decision tree, and CRISP-DM concepts

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.