Data Mining Method Series (i) Data exploration

Last Update:2018-09-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Why do you want to do data exploration?
Understanding the type of data and the way people communicate with each other is as important as understanding each other's gender, and people communicate with each other to know their gender to communicate in different ways, and different types of data can do things differently.
Explore what data to explore? The type of data and the quality of the data.
Data types are classified into qualitative and quantitative.
Qualitative can also be said to be classified, including nominal and ordinal. Nominal very good understanding, user ID, user name also belong to the nominal, although can also be repeated, but generally can represent an individual; ordinal has type {good, very good, super good}, can compare size, such as "Super good" than "good" in good degree to high, {High, High, very high} also belong to ordinal.
Quantification can be said to be continuous, including intervals and ratios. The interval can be done in a bad operation. For example, the date can be the interval between the date, this year and last year, a year difference, the ratio can be both range, but also to calculate the ratio. For example, age is the ratio, 20 years old than 30 years old young 10 years old, can also ask for the mean value of age.
Data types In addition to this classification there are other classifications, but such classification is the basic classification, mastered can be status quo.

The quality of the data is mainly: Missing attribute values, object duplication, outliers, inconsistent data, and data errors. There are many reasons for these data quality problems, such as errors in operator manual entry, clerical and precision deviations from user input (inadequate understanding of a problem or unreasonable questionnaire design), and other problems such as failure of sensor collection. At present, very few enterprises began to collect a large number of data is to do mining, the basic data is accumulated to a certain amount and then there is the need to do mining, whether it is from the data or business-driven, so that the data may be scattered in the various business systems, missing, inconsistent problems inevitably exist, need to pass a variety of preprocessing means To increase the quality of the data to a certain height.

So the question is, how do you do data exploration?
As I said before, you need to explore data types and data quality, and then use two tools to explore the data, IBM SPSS Modeler for commercial data mining software, and the Python language.
IBM SPSS Modeler is now an IBM Data Mining tool that enables data mining modeling in a drag-and-drop fashion. The method of use is not described here, only the results of the exploration are presented.
This is the data type of the Exploration field, the continuous type, the value range, and whether there is a missing.

The following is an exploration of data quality, divided into data description statistics and quality assessment.
Description statistics include graphical/data type/min/MAX/average/Standard deviation/skewness/Whether unique/valid values and so on these indicators;

Quality assessment includes outlier/extremum/completion rate/valid record/invalid value/character null number/blank number/control number etc.

Modeler is the best tool for getting started with the mining tools I've used so far, and although the data processing and support mining algorithms are not the most efficient, the execution is not the highest, but it's easy to understand, and if it's a copyright risk inside the company, or if it's big data and poor, then use Python.
Python language is an open source programming language, in which many great gods contribute a lot of modules, we directly import modules, you can use the function of the module, although it is a programming language, but the cost of learning is really low, a lot of functions are to be used.
#导入各个模块
From Sklearn import Datasets #导入机器学习库中的数据集
Import pandas as PD #导入pandas模块 to process data,

Iris=datasets.load_iris ()
Iris_x=iris.data
Iris_y=iris.target

IRIS_X1=PD. DataFrame (iris_x)
IRIS_Y1=PD. Series (iris_y) #因为下面用的数据探索的函数只有pandas中的DataFrame, series

Print (X1.describe (), X1.head (), X1.corr (), X1.corrwith (y1)) #引用数据探索的函数

tools are always tools, only help us to work, can not replace our thinking, only keep thinking about what needs to be done, how to make progress ~

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Data Mining Method Series (i) Data exploration

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Data Mining Method Series (i) Data exploration

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support