How can we fully understand "Data Mining "? What is the theoretical basis of "data mining?
Figure 1 shows:
In reality, human social and economic activities can always be described and recorded using data (numbers or symbols). After analyzing these data, information (knowledge) will be generated ); with this information (knowledge) to guide practice, you can make corresponding decisions; these decisions have triggered a new round of social and economic activities. The cycle repeats and the interest is not limited.
What are the differences between data warehouse (DW), business intelligence (BI), and Knowledge Discovery (KDD?
The dotted line in Figure 2 has two meanings.
The first reason is that the above concepts were initially put into being and focused on the value chain of DM. Data Warehouse focuses on "Warehouse Building" and Data Mining and Knowledge Discovery focuses on "processing ", business Intelligence focuses on "Applications ". The dotted line indicates that it was previously owned.
Second, if this is not the case, the theoretical and application manufacturers will not agree, because no matter the original database (IBM, Sybase, NCR, Oracle, Microsoft, etc ), the statistical analysis software (SAS, statistica, SPSS, etc), and even the reporting tools (Bo, Brio, Cognos, etc) are desperately extending their own value chains.
Therefore, simply call Data Management (DM) to make sure that all data is in the world.
As for ERP, CRM, etc. To put it bluntly, it is still a DM, but it is limited to specific social and economic activities.
Six mining weapons
Data Warehouse Construction and Data Mining modeling are two major technical points in the DM value chain. In a narrow sense, data mining only involves data to knowledge. As a data mining personnel, the minimum requirement is to fully master the performance, limitations, and application conditions of various mining tools.
Generally, data mining has the following six weapons: describing statistics, association and correlation, classification and clustering, prediction, optimization, and structural equation models. Brief description:
(1) Descriptive Statistics)
Descriptive statistics are an easy-to-use weapon for data mining. They are intuitive and simple, and are often used by experts to extract leaves and fly flowers. Description statistics include average, median, mode, quantile, percentage, sum, etc. Description statistics are often used with statistical charts (such as histograms, bar charts, line charts, scatter plots, and leaf charts. At present, the most widely used OLAP, the essence of which is to describe statistics for different data groups.
Descriptive statistics are widely used, such as the total profit of the company in the current month and the sales volume of different regions.
(2) Association and Correlation)
In essence, association rules are conditional probabilities: What is the probability that B will appear at the same time when a occurs? As long as B is far away from 50%, it makes sense.
A typical modern application of association rules is "beer and diapers ". When applying association rules, you need to consider the following question: what is the number of followers of this rule? In layman's terms, if only one person buys diapers in a supermarket (Suppose), but this person will definitely buy beer every time he buys diapers. Although this rule is trustworthy (100%), it does not make much sense.
When applying association rules, pay attention to two points: association is not necessarily causal, and association is directed.
Correlation also considers the relationship between two things. Typical measurements include Pearson correlation coefficient and Kendall correlation coefficient.
(3) Classification and clustering
Classification and clustering are the most common technologies.
Generally, there are three classification methods: regression, decision tree, and neural network.
The biggest difference between clustering and classification is that classification is supervised and clustering is unsupervised. What is supervision? It is a standard, or a target variable. There is no target for clustering. "Things are clustered by groups ". Clustering does not know what features each category has. After clustering, we can sum up and discover what is in common.
(4) Prediction
The commonly used method of prediction is time series, and regression can also be used for prediction.
Common time series methods include arma, exponential smoothing, and trend push. The biggest feature of time series is to fully explore the laws of things over time. Because there is always a rule for everything, such as enterprise sales, without special external factors.
(5) Optimization
Optimization is a concept in operations research. One of the main problems solved is how to rationally configure resources under various constraints to maximize (minimize) the target elements.
(6) structural equation model
Unlike the above applications, the structural equation model focuses on how to reveal the internal structure and interaction principles of things. For example, how to measure customer satisfaction? What is the relationship between customer satisfaction and customer expectations, products, prices, services, complaint handling, and customer loyalty? How does it work? Only by clarifying these relationships can we continuously improve customer satisfaction and customer loyalty. The structural equation model plays this role.
How is the data fully presented?
From the application perspective, DM is not only the Organization or presentation of data, but also the data analysis and statistical modeling, it is a complete process from understanding business needs, seeking solutions, and accepting practical tests ).
There are many methodologies in the industry that guide project practice, similar to that of CRISP-DM.
CRISP-DM is divided into the following six phases: Business understanding, data understanding, data preparation, modeling, Evaluation) and deployment ).
For example, for cooking and waiting for customers, commercial understanding is to understand the taste of customers; Data understanding is to be familiar with what dishes can be fried with each raw material; data preparation is based on the customer's taste and the chef's experience, serving, selecting, and washing dishes. Modeling relies entirely on the level of chef cooking. at the evaluation stage, the customer will taste the food; if you are satisfied, it will be the final stage to be promoted as a Chinese dish. The DM process is a complete service process that customers are satisfied with on an empty stomach.
A successful DM project can not only be aimed at the operation level, but also enhance automation. It can also be targeted at the decision-making level and optimize decision-making.
Detailed deployment of Implementation Plan
According to the NCR data mining methodology, NCR divides the implementation of data mining projects into five stages, including defining the scope of business issues, selection and sampling, and exploratory data analysis, modeling, and implementation.
1. Define the scope of business issues: In this initial stage, the project objectives and customer business needs should be clearly stated to clarify data mining problems. Tasks include clarifying business objectives, defining response variables, and necessary adjustments to the project plan.
2. Selection and sampling: At this stage, the modeling team should search for and check customer data as a brief list of variables used for future analysis and mining. Sample and generate a training set, verification set, and test set from the data population. Tasks include: data source, data ing, data preparation evaluation, necessary data aggregation, and data sampling.
3. exploratory Data Analysis (data exploration): At this stage, the modeling team checks the current data source and tries to find out whether there is any relationship between each independent variable to be selected and the target variable. Generally, numerical analysis is the first step to fully understand data. The statistical analysis follows is to obtain a better knowledge about data distribution. This is a key stage in the data mining process.
Tasks include: data quality check; necessary data sorting; understanding data through graphical presentation tools and other statistical methods; analyzing the relationship between the variables to be selected and the target variables; data conversion is used to assist data analysis; data derivation is used to prepare for model creation; data discovery is organized and presented.
4. Modeling: At this stage, the modeling team establishes and confirms the mining model. Modeling teams usually try different modeling techniques or use different datasets to measure the performance of models and select the best ones. The business domain knowledge from end users is critical at this stage because they can evaluate and validate the results of the model, understand the findings, and put them into practice.
Tasks include preparing datasets for Model Training and verification; using appropriate modeling techniques in model creation; testing model performance based on different modeling techniques; refining and mining models as necessary; test the mining model with the topic expert and record the Mining Model and result.
5. Implementation: At this stage, model results are required to help make business decisions, strategic design, and tactical implementation. Collects implementation result feedback to detect model degradation and further improve model performance. Complex presentation layer interfaces are usually unnecessary when model results are used. Automation of data mining is an indispensable part of CRM (Customer Relationship Management) solutions. Therefore, it is a separate project from typical data mining.
Tasks include customer model scoring and storage model results, performance tracking, and further integration with other business systems; Data Mining automation is a separate project; and model result field testing is a separate project.
The project plan of a data mining project covers all the above stages, but the time required to complete the project depends on multiple key factors, such as the complexity of the mining topic, the customer's expectations for the performance evaluation of the mining results, the available data completeness and data quality, whether the project's human resources are sufficient, and personnel capabilities. For example, table 1 is a two-month Data Mining Project Plan (40 working days), which can be used as a reference for other mining projects.
From the project plan in table 1, we can see that the members or roles involved in the data mining project include data mining experts and PDM (Product Data Management) modelers, ETL developers, and application developers. At the same time, personnel familiar with the business and those familiar with the data warehouse PDM should be supported. (