Data Mining notes (2)

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Common Data Mining Methods

Common methods for data analysis using data mining include classification, regression analysis, clustering, association rules, features, change and Deviation Analysis, and Web page mining, they mine data from different perspectives.

① Category.

Classification is to identify the common characteristics of a group of data objects in a database and classify them into different classes according to the Classification mode. The purpose is to use the classification model, map data items in the database to a given category. It can be applied to customer classification, customer attribute and Feature Analysis, customer satisfaction analysis, and customer purchase trend prediction, for example, a car retailer divides customers into different categories based on their preferences of cars, so that marketers can mail the advertisement manuals of new cars directly to customers with such preferences, this greatly increases business opportunities.

② Regression analysis.

The regression analysis method reflects the time characteristics of the attribute values in the transaction database, and generates a function that maps data items to a real-Value Prediction variable to discover the dependencies between variables or attributes, its main research problems include the trend characteristics of data sequences, the prediction of data sequences, and the correlation between data sequences. It can be applied to all aspects of marketing, such as customer seeking, maintaining and preventing customer loss activities, product lifecycle analysis, sales trend prediction, and targeted promotion activities.

③ Clustering.

Clustering analysis divides a group of data into several categories based on similarity and difference. The purpose is to make the similarity between data of the same category as large as possible, and the similarity between data of different categories as small as possible. It can be applied to customer group classification, customer background analysis, customer purchase trend prediction, and market segmentation.

④ Join rules.
Example: 90% of the customers who bought the bread bought the milk at the same time.
Association rules are rules that describe the relationship between data items in the database. That is, based on the appearance of certain items in a transaction, other items can also be exported in the same transaction, this is the association or correlation between hidden data. In customer relationship management, by mining a large amount of data in the enterprise's customer database, you can find interesting associations from a large number of records and find out the key factors that affect the marketing effect, it provides reference for product positioning, pricing and customization of customer groups, customer seeking, segmentation and maintenance, marketing and marketing, marketing risk assessment, fraud prediction, and other decision-making support.

⑤ Features.

Feature Analysis extracts feature patterns about the data from a set of data in the database. These feature types express the overall features of the dataset. For example, by extracting the characteristics of the customer churn factor, the marketing staff can obtain a series of reasons and main features that lead to the customer churn. Using these features can effectively prevent the customer churn.

⑥ Change and Deviation Analysis.

Deviations include a large amount of potentially interesting knowledge, such as abnormal instances in classification, exceptions to patterns, and expected deviations from observed results, the objective is to find a meaningful difference between the observed results and the reference volume. Managers are more interested in unexpected rules in enterprise crisis management and early warning. Mining of unexpected rules can be applied to the discovery, analysis, identification, evaluation and warning of various abnormal information.

7. Web page mining.

With the rapid development of the Internet and the popularity of the web world, the amount of information on the Web is extremely rich. Through Web Mining, massive web data can be used for analysis, collect information about politics, economy, policy, technology, finance, various markets, competitors, supply and demand, and customers, focus on analyzing and processing external environment information and internal business information that has a major or potential significant impact on the Enterprise, based on the analysis results, we can find out various problems that may cause crisis in the enterprise management process and analyze and process the information to identify, analyze, evaluate, and manage crisis.

Functions of Data Mining

Data Mining predicts future trends and behaviors to make proactive and knowledge-based decisions. The goal of data mining is to discover hidden and meaningful knowledge from the database, mainly including the following five features.

1. Automatically predict trends and behaviors, and predict knowledge (Prediction)

Data Mining automatically searches for predictive information in large databases. In the past, problems that require a large amount of manual analysis can now be quickly concluded by the data itself. A typical example is market prediction. Data mining uses promotional data in the past to find the most rewarding users in future investment, other predictable problems include predicting bankruptcy and identifying the groups most likely to respond to a specified event.

2. Association Analysis and Association knowledge)

Data Association is an important and discoverable knowledge in databases. If there is a regularity between the values of two or more variables, it is called Association. Associations can be divided into simple associations, time series associations, and causal associations. The purpose of association analysis is to find hidden associated networks in the database. Sometimes you do not know the association functions of the data in the database, even if you know it, it is not clear. Therefore, the Association Analysis Rules have credibility.

3. Clustering and Classification knowledge (Classification & Clustering)

The records in the database can be divided into a series of meaningful subsets, that is, clustering. Clustering enhances people's understanding of objective reality and is a prerequisite for conceptual description and Deviation Analysis. Clustering technology mainly includes traditional pattern recognition methods and mathematical taxonomy.

4. concept description

The concept description is to describe the connotation of a certain object and summarize the relevant features of such objects. Conceptual descriptions are divided into characteristic descriptions and distinctive descriptions. The former describes the common features of a certain object, and the latter describes the differences between different similar objects. Generating a class's characteristic description only involves the commonality of all objects in this class. Many methods are used to generate a distinctive description, such as the decision tree method and genetic algorithm.

5. Deviation)

The data in the database often has some exception records. It makes sense to detect these deviations from the database. Deviations include many potential knowledge, such as abnormal instances in classification, exceptions that do not meet the rules, deviations between observed results and model predicted values, and changes in the value over time. The basic method for deviation detection is to find meaningful differences between the observed results and the reference values.

Common data mining technologies

1. Artificial Neural Networks

Neural Networks have received more and more attention recently because they provide a relatively effective and simple method to solve the problem of high complexity. Neural Networks can easily solve problems with hundreds of parameters. Neural Networks are commonly used in classification and regression.

In terms of structure, a neural network can be divided into an input layer, an output layer, and a hidden layer (see figure 4 ). Each node in the input layer corresponds to prediction variables.

Except for the nodes at the input layer, each node in the neural network is connected with many nodes before it (the input node of this node). Each connection corresponds to a weight of Wxy, the value of this node is obtained through the sum of the values of all its input nodes and the corresponding concatenation weight product as the input of a function. We call this function an activity function or an extrusion function.

2. Decision Tree

A decision tree provides a method to demonstrate rules similar to the conditions under which values will be obtained. For example, in a loan application, we need to determine the risk of the application. Figure 7 is a decision tree established to solve this problem. We can see the basic components of the decision tree: decision nodes, branches, and leaves.

3. Genetic Algorithms

Based on evolutionary theory, and using genetic integration, genetic variation, and natural selection and other design methods of optimization technology.

4. Nearest Neighbor Algorithm

How to classify each record in a dataset.

5. Rule Derivation

In a statistical sense, the "if-then" rule in the data is searched and deduced.

Data Mining and Online Analytical Processing (OLAP)

A frequently asked question is, what is the difference between data mining and OLAP. The following explains that they are completely different tools and their technologies are quite different.

OLAP is part of the decision support field. The traditional query and report tools tell you what is in the database (What happened), while OLAP further tells you what will happen next (What next) and What will happen if I take such a measure ). The user first establishes a hypothesis and then uses OLAP to retrieve the database to verify whether the hypothesis is correct. For example, if an analyst wants to find out why the loan is in arrears, he may first make an initial assumption that the credit of a low-income person is low, and then use OLAP to verify his assumption. If this assumption is not confirmed, he may look at those high-debt accounts. If not, he may have to consider the income and liabilities together and keep going, until he finds the desired result or gives up.

That is to say, OLAP analysts establish a series of assumptions, and then use OLAP to confirm or overturn these assumptions to finally reach their own conclusions. OLAP analysis is essentially a process of deductive reasoning. However, if dozens or hundreds of variables are analyzed, it is very difficult and painful to use OLAP to manually analyze and verify these assumptions.

Data Mining is different from OLAP in that data mining is not used to verify the correctness of a certain hypothetical model (model), but to find a model in the database. In essence, it is an inductive process. For example, an analyst using data mining tools wants to find the risk factors that cause loan defaults. Data mining tools may help him find the cause of high debt and low income, or even find other factors that analysts have never thought about or tried, such as age.

Data Mining and OLAP are complementary. Before using the conclusions of data mining to take actions, you may need to verify the impact of such actions on the company, so that OLAP tools can answer your questions.

In addition, in the early stages of knowledge discovery, OLAP tools have other functions. It helps you explore data, find variables that are important to a problem, and discover abnormal data and variables that affect each other. This can help you better understand your data and speed up the process of knowledge discovery.

Considerations

Specifically, the following eight questions should be considered:
1. ultra-large scale databases and high-dimensional data problems;
2. Data loss issues;
3. Changed data and knowledge issues;
4. easy-to-understand Mode;
5. Non-standard format data, multimedia data, and object-oriented data processing;
6. integration with other systems;
7. KDD problems in the network and distributed environment.
8. Privacy Issues

Materials
[1] http://wiki.mbalib.com/wiki/data Mining
[2] http://baike.baidu.com/view/7893.htm
[3] http://www.stcsm.gov.cn/learning/lesson/xinxi/20021125/lesson-4.asp
[4] http://www.stcsm.gov.cn/learning/lesson/xinxi/20021125/lesson-5.asp

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Data Mining notes (2)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Data Mining notes (2)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support