Come with me. Data Mining (19)--What Is Data mining (2)

Source: Internet
Author: User

What is a data warehouse?

A data warehouse is a subject-oriented (Subject oriented), integrated (integrate), relatively stable (nonvolatile), data collection that reflects historical changes (time Variant) to support management decisions. For the concept of a data warehouse we can do it from two levels:

① Data Warehouse is used to support decision-making and analytical data processing, which is different from the existing operational database of enterprises.

② Data Warehouse is an effective integration of multiple heterogeneous data sources, which is reorganized according to the topic, and contains historical data, and the data stored in the Data warehouse is not modified in general.

The construction of enterprise Data Warehouse is based on the accumulation of existing enterprise business system and large amount of business data. The Data warehouse is not a static concept, only the information in a timely manner to users who need it, for them to improve their business operations decision-making, information can play a role, information only makes sense. The basic task of data Warehouse is to organize, summarize and reorganize the information and provide it to the corresponding management decision-maker in time.

Data cube and OLAP

The data obnoxious body is modeled and observed in multidimensional data import rows.

is the data cube for customers, products, and sales:

Multidimensional analysis operations for OLAP include: drillthrough (drill-down), roll up (roll-up), slicing (Slice) , (Dice) and rotation (Pivot).

drillthrough (Drill-down): Changes in the different levels of the dimension, from the upper layer down to the next level, or to split the aggregated data into more detailed data, such as by drilling through the second quarter of 2010, to see the second quarter of 2010 4, 5, 6 consumption data per month.

Roll up (roll-up): The inverse operation of drilling, that is, from fine-grained data to high-level aggregation, such as Jiangsu Province, Shanghai and Zhejiang province to summarize sales data to view the Zhejiang-Shanghai area sales data.

Slice (Slice): Select a specific value in the dimension for analysis, such as selecting only sales data for electronic products, or data for the second quarter of 2010.

Cut (Dice): Select data for a specific interval in a dimension or a specific value for analysis, such as sales data for the first quarter of 2010 through the second quarter of 2010, or for electronic products and commodities.

rotation (Pivot): That is, the position of the dimension of the interchange, like a two-dimensional table of the row and column conversion, through the rotation of the product and the geographical dimension of the interchange.

Four kinds of problems in data mining solution

1. Classification

Classification technology is used in many fields, for example, a classification model can be constructed by customer classification to carry out risk assessment of bank loans. A key feature of current marketing is the emphasis on customer segmentation. The function of customer category analysis is also in this, the use of classification technology in data mining, can be divided into different categories of customers, such as call center design can be divided into: call-frequent customers, the occasional large number of customers, stable call customers, others, help call center to find out the characteristics of these different types of customers, Such a classification model allows users to understand the distribution characteristics of different behavioral categories of customers, other classification applications such as literature search and search engine automatic text classification technology, security domain based on classification technology intrusion detection and so on. Researchers in the fields of machine learning, expert systems, statistics and neural networks have put forward many specific methods for predicting classification. Here is a brief description of the classification process:

Training: Training--Feature selection--training--classifier

Category: New samples--Feature selection----and verdict

Here is an example of a decision tree-based classifier:

2. Clustering

Clustering: The data objects are divided into several classes, the same class of objects have a higher similarity, not similar object similarity is low. From this simple description, it can be seen that the key of clustering is how to measure the similarity between objects . The more common methods used to measure the similarity of objects are distance , density and so on.

The principle of cluster analysis can be viewed as follows:

To group cards:

According to the color of:

Divide by symbol:

By color:

By the size of the degree of similarity:

Here is an example of a cluster:

3. Forecast

There are similarities between data mining prediction and Zhouyi prediction. The Zhouyi is based on the dualism of Yin and Yang, and the character classification of All Things (zodiac), accurate to the future development of things to make more accurate predictions. Many scholars believe that the theory of Zhouyi is the similarity, relevance and holographic principle of everything. These three principles have been confirmed by modern science. Holographic means that a part of a thing contains the whole information. For example, a forensic worker tests a hair to obtain many of the physical characteristics of a victim or suspect.

The book of changes predicts the future state of things by accumulating experience through the study of historical events, drawing similarities and correlations between things. The data mining prediction is based on the study of the input value and output value of the sample data (historical data), the prediction model is obtained, and then the output value of the future input value is predicted by the model. In general, predictive models can be built using machine learning methods . DM (Data Mining) is based on Artificial Intelligence (machine learning), but DM only takes advantage of some of the proven algorithms and techniques in artificial intelligence (AI), making it much less complex and difficult than AI.

Machine Learning : Assume that there is a functional relationship between the input and output of a thing y=f (x,β), where β is a pending parameter and x is an input variable, then y=f (x,β) is called the learning machine . By using data Modeling , the sample data (typically historical data, including input values and output values) is learned to obtain the value of the parameter β, which determines the specific expression y=f (x,β), so that the new x can be predicted y. This process is called machine learning.

Data modeling differs from mathematical modeling in that it is a mathematical model based on data, which is relative to the basic principles of physics, chemistry, and other disciplines to establish mathematical models (i.e., mechanism modeling ). For the prediction, if the object studied has a clear mechanism, it can be based on mathematical modeling, which is of course the best choice. However, in practical problems, mechanism modeling is generally not possible. But historical data is often readily available, and data modeling can be used.

Typical machine learning methods include: decision Tree Method , artificial neural network , support vector machine , regularization method . Other common prediction methods include the nearest neighbor method , naive Bayesian (which belongs to the statistical learning method) and so on.

The predictive model can be referenced by:

4. Association

Analyze the probability of each item or product appearing at the same time.

In all kinds of data mining algorithms, Association rules mining is an important one, especially influenced by shopping basket analysis, and the association rules are applied to many real business.

First, as with the Clustering algorithm, association rules mining belongs to unsupervised learning method, it is described in a thing in the same time the law of the knowledge model, in real life, such as supermarket shopping, customer purchase records often implies a lot of association rules, such as the purchase of ballpoint pen customers 65% also bought a notebook, Using these rules, the store staff can well plan the issue of commodity placement. In the e-commerce website, the use of association rules can be found which users prefer which kind of goods, when found to have similar customers, can be other customers buy products to recommend to similar customers, to improve the revenue of the site.

is an example of an association:


The CRISP-DM model provides a complete process description for a KDD project. The model divides a KDD project into 6 distinct phases, but the order is not completely unchanged.

1:business Understanding: Business understanding. In the first phase we must understand the project's requirements and the ultimate purpose from a business perspective. These goals are combined with the definition and results of data mining.

2:data Understanding: Data is understood and collected to evaluate the available data.

3:data preperation: Data Preparation, a series of organization and cleaning of available raw data to meet modeling requirements.

4:modeling: The Application Data Mining tool builds the model.

5:evaluation: Evaluate the established model, focusing specifically on whether the results are in line with the business objectives of the first step.

6:deployment: Deployment (Scenario implementation), the results of its discovery and the process of organizing into a readable form of text. (Data Mining report).

Business Understanding: The Business understanding phase should be considered the most important part of data mining, where we need to identify business objectives, assess the business environment, identify the objectives of the excavation, and generate a project plan.
Data Understanding: data is the "raw material" of our excavation process, and we need to know what data is in the process of data understanding, what is the characteristics of the data, and the characteristics of the data can be obtained by descriptive analysis of the data.
Data Preparation (Date preparation): in the data preparation phase we need to choose, clean, rebuild and merge the data. Select the data to be analyzed and normalize the data that does not conform to the model input requirements.
Modeling (Modeling): The modeling process is also a relatively important process in data mining. We need to select suitable model tools for analysis purposes, build models from samples and evaluate models.
Model Evaluation (EVALUATION): Not every modeling is consistent with our goals, the evaluation phase is designed to evaluate the results of the modeling, we need to analyze the reasons for poor results, and sometimes we need to return to the previous steps to redefine the mining process.
result Deployment (Deployment): This stage uses the established model to solve the actual problem, it also includes the process of supervising, maintaining, producing the final report, re-evaluating the model, etc.


The above introduces data warehouse and data cube respectively, and introduces four kinds of problems that data mining should solve, any problems related to data mining can be classified into four kinds of problems first, and then solved according to the corresponding algorithm.

Finally, this paper introduces the CRISP-DM model, which is the standard model proposed by IBM, which can guide the process of data mining theoretically. The next step is to explore how user portraits are made for user-generated data.

Join me. Data Mining (19)-What Is Data mining (2)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.