Differences between data mining and Statistics (Guide to intelligent data analysis study notes)

Source: Internet
Author: User

When it comes to data mining, we tend to focus on algorithms during modeling while ignoring other steps. In real world data mining projects, other steps are the key to determining project success or failure. Guide to intelligent data analysis is the book recommended by the knime official website (http://tech.knime.org/guide-to-intelligent-data-analysis), according to the CRISP-DM process describes the process of data mining process.

Let's start with data mining. To understand what data mining is, you must first distinguish between data and knowledge.

ComparisonDataAndKnowledgeFeatures:

Data

Knowledge

It involves a single instance.

(Single thing, person, event, time point, etc)

It involves class instances.

(Set of things, people, events, and time points)

It describes individual properties.

Describes general patterns, structures, rules, rules, etc.

Usually obtained in a large scale

(Databases, archives)

Include as few statements as possible

Generally, collection is relatively simple.

(For example, small tickets in supermarkets and some data on the Internet)

It is usually difficult and time-consuming to locate and obtain

We cannot make predictions.

Allow us to make predictions and outlook

 

We also need to explore the differences and relationships between statistics (Statistic) and KDD.

Statistics:

StatisticsAs a discipline with a long history, it originated from collecting and analyzing data about population and countries. It can be divided into descriptive statistics and Inferential statistics.

Descriptive statisticsDescriptive statistic (descriptive statistic) generally prefers to use feature values like average values or icons like histograms to describe data. It generally does not make specific assumptions.

Thrust statistics(Inferential statistic) Relatively descriptive statistics provide more rigorous assumptions based on the random data generation process. The conclusion is valid only when its hypothesis is satisfied.

Generally, the first step of statistics on data analysis is to design an experiment that defines how data is collected. Based on this data, we need to make a reliable analysis. GenerallyExperimental learning(Experimental Study) we can control the box to manipulate the data generation steps. HoweverObservation Learning (In observational study), we cannot control the data generation process.

However, whether the learning process is experimental or observed, it usually contains independent assumptions, and the data we collect should also be representative. We always apply the collected data in the promotion statistics.Hypothesis TestOne of the main reasons for (hypothesis testing) is that we want to determine or reject assumptions about this field.

 

In terms of data mining:

Contrary to the hypothesis test,Exploratory Data Analysis(Exploratory data analysis) is concerned with generating assumptions from collected data (generating hypothesis ). In exploratory data analysis, there are no or only weak model assumptions about the data generation process. A typical scenario is that we already have data. They may not be using mobile phones in the best way. Therefore, it is difficult to make specific assumptions about the data generation process. We are goal-oriented, that is to say, we propose questions like "What customers will bring the highest benefits" and research methods that can help us answer and solve such problems.

The so-calledData Mining TechnologyA definition of data mining techniques refers to the technology that uses powerful tools and techniques to analyze a large amount of data for a large number of commercial databases collected for different purposes. Some people once thought that using the correct data mining tool can automatically or simply rely on a small amount of Manual Interference to get any knowledge we want. However, practical experience shows that every problem is different and the automatic implementation of the data analysis process is almost impossible.

This is how we understandKDD(Knodge DGE discovery in database), we think it is actually an interactive process of defining valid, novel, potentially useful, and eventually understandable patterns in data. Data Mining and modeling are only one of the steps.

 

From the analysis above, we can see the differences between KDD, data mining, and statistic. The essential difference lies in the attitude towards data assumptions. What we should do is to make a beautiful analysis project based on a standard data analysis process.

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.