Data Mining Method Series (i) Data exploration

Source: Internet
Author: User

Why do you want to do data exploration?
Understanding the type of data and the way people communicate with each other is as important as understanding each other's gender, and people communicate with each other to know their gender to communicate in different ways, and different types of data can do things differently.
Explore what data to explore? The type of data and the quality of the data.
Data types are classified into qualitative and quantitative.
Qualitative can also be said to be classified, including nominal and ordinal. Nominal very good understanding, user ID, user name also belong to the nominal, although can also be repeated, but generally can represent an individual; ordinal has type {good, very good, super good}, can compare size, such as "Super good" than "good" in good degree to high, {High, High, very high} also belong to ordinal.
Quantification can be said to be continuous, including intervals and ratios. The interval can be done in a bad operation. For example, the date can be the interval between the date, this year and last year, a year difference, the ratio can be both range, but also to calculate the ratio. For example, age is the ratio, 20 years old than 30 years old young 10 years old, can also ask for the mean value of age.
Data types In addition to this classification there are other classifications, but such classification is the basic classification, mastered can be status quo.

The quality of the data is mainly: Missing attribute values, object duplication, outliers, inconsistent data, and data errors. There are many reasons for these data quality problems, such as errors in operator manual entry, clerical and precision deviations from user input (inadequate understanding of a problem or unreasonable questionnaire design), and other problems such as failure of sensor collection. At present, very few enterprises began to collect a large number of data is to do mining, the basic data is accumulated to a certain amount and then there is the need to do mining, whether it is from the data or business-driven, so that the data may be scattered in the various business systems, missing, inconsistent problems inevitably exist, need to pass a variety of preprocessing means To increase the quality of the data to a certain height.

So the question is, how do you do data exploration?
As I said before, you need to explore data types and data quality, and then use two tools to explore the data, IBM SPSS Modeler for commercial data mining software, and the Python language.
IBM SPSS Modeler is now an IBM Data Mining tool that enables data mining modeling in a drag-and-drop fashion. The method of use is not described here, only the results of the exploration are presented.
This is the data type of the Exploration field, the continuous type, the value range, and whether there is a missing.

The following is an exploration of data quality, divided into data description statistics and quality assessment.
Description statistics include graphical/data type/min/MAX/average/Standard deviation/skewness/Whether unique/valid values and so on these indicators;

Quality assessment includes outlier/extremum/completion rate/valid record/invalid value/character null number/blank number/control number etc.

Modeler is the best tool for getting started with the mining tools I've used so far, and although the data processing and support mining algorithms are not the most efficient, the execution is not the highest, but it's easy to understand, and if it's a copyright risk inside the company, or if it's big data and poor, then use Python.
Python language is an open source programming language, in which many great gods contribute a lot of modules, we directly import modules, you can use the function of the module, although it is a programming language, but the cost of learning is really low, a lot of functions are to be used.
#导入各个模块
From Sklearn import Datasets #导入机器学习库中的数据集
Import pandas as PD #导入pandas模块 to process data,

Iris=datasets.load_iris ()
Iris_x=iris.data
Iris_y=iris.target

IRIS_X1=PD. DataFrame (iris_x)
IRIS_Y1=PD. Series (iris_y) #因为下面用的数据探索的函数只有pandas中的DataFrame, series

Print (X1.describe (), X1.head (), X1.corr (), X1.corrwith (y1)) #引用数据探索的函数

  tools are always tools, only help us to work, can not replace our thinking, only keep thinking about what needs to be done, how to make progress ~

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.