Input data and ARFF files-Data Mining learning and WEKA usage (2)

Source: Internet
Author: User

I personally think we can directly discuss data mining.AlgorithmAnd WEKA are too impatient to use. I learned data mining methods directly from the beginning. Some methods are difficult and boring. What I often think about is not the method itself, but "What is this ?".

After WEKA is used, some things gradually become clearer, because the input and output give people a very intuitive feeling, and the learning efficiency is very high when combined with the technology itself.

There are three types of inputs: Concept, instance, and attribute.

Concept

The concept is simply something that needs to be processed. It can be the classified sample set in classification learning.

What you need to deal with may be very different, but you can refer to them as concepts, and output is the description of them, that is, the concept description.

Instance

The term "instance" may be unfamiliar to you, but you can generally think of it as a sample.

We usually input an instance set. Each instance is a single and independent concept sample.

Of course, the most common instance performance method is the table:

However, this is because some people say that data mining should become file mining.

It is true that relational databases can present more complex relationships, but a finite set of finite relationships can generally be converted to a single table. If you are interested in taking a deeper look, you can take a look at the false facts related to reverse normalization.

Attribute

If the instance mentioned above is a row in the table, the attribute is a column in the data table.

A property value of a specific instance is a measurement or observed value of the corresponding part of the attribute.

ARFF format

ARFF format is a special WEKA file format, which is full name of Attribute-relation file format.

It is an ASCII text file that records some instances of shared attributes. The ARFF format is developed by the Computer Science Department of the University of Huaihua.

An ARFF file consists of two parts: header definition and data zone.

The header definition contains the relation name, attributes, and corresponding types, such:

 
%1. Title: Iris plants Database
%
%2. Sources:
% (A) Creator: r.a. Fisher
% (B) donor: Michael Marshall (Marshall % PLU@io.arc.nasa.gov)
% (C) Date: July,1988
%
@ Relation Iris

@ Attribute sepallength numeric
@ Attribute sepalwidth numeric
@ Attribute petallength numeric
@ Attribute petalwidth numeric
@ Attribute Class {iris-setosa, iris-versicolor, iris-virginica}

% Is the annotator. Numeric indicates that it is a numeric type. The value of the attribute class is limited. It can only be one of iris-setosa, iris-versicolor, and Iris-virginica. The data type can also start with @ data in the string and data areas, for example:

@ Data

5.1 , 3.5 , 1.4 , 0.2 , Iris-setosa
4.9 , 3.0 , 1.4 , 0.2 , Iris-setosa
4.7 , 3.2 , 1.3 , 0.2 , Iris-setosa
4.6 ,3.1 , 1.5 , 0.2 , Iris-setosa
5.0 , 3.6 , 1.4 , 0.2 , Iris-setosa
5.4 , 3.9 , 1.7 , 0.4 , Iris-setosa
4.6 , 3.4 , 1.4 , 0.3 , Iris-setosa
5.0 , 3.4 , 1.5 , 0.2 , Iris-setosa
4.4 , 2.9 , 1.4 , 0.2 , Iris-setosa
4.9 , 3.1 ,1.5 , 0.1 , Iris-setosa

A complete ARFF file is as follows:

% 1 . Title: Iris plants Database

%
% 2 . Sources:
% (A) Creator: r.a. Fisher
% (B) donor: Michael Marshall (Marshall % PLU@io.arc.nasa.gov)
% (C) Date: July, 1988
%
@ Relation Iris

@ Attribute sepallength numeric
@ Attribute sepalwidth numeric
@ Attribute petallength numeric
@ Attribute petalwidth numeric
@ Attribute Class {iris-setosa, iris-versicolor, iris-virginica}

@ Data
5.1 , 3.5 , 1.4 , 0.2 , Iris-setosa
4.9 , 3.0 , 1.4 , 0.2 , Iris-setosa
4.7 ,3.2 , 1.3 , 0.2 , Iris-setosa
4.6 , 3.1 , 1.5 , 0.2 , Iris-setosa
5.0 , 3.6 , 1.4 , 0.2 , Iris-setosa
5.4 , 3.9 , 1.7 , 0.4 , Iris-setosa
4.6 , 3.4 , 1.4 , 0.3 , Iris-setosa
5.0 , 3.4 , 1.5 , 0.2 , Iris-setosa
4.4 , 2.9 ,1.4 , 0.2 , Iris-setosa
4.9 , 3.1 , 1.5 , 0.1 , Iris-setosa

Effect of opening the dataset in WEKA:

Circle 1 is all attributes

Circle 2 is a simple visual preview

The third part is the descriptive statistics of data.

Sparse ARFF format

If most of the input instance values are 0, you can consider the sparse format. Its header definition is the same as the ARFF format.

The only difference is that 0 does not need to be declared by default.

For example:

@ Data
0, X,0, Y,0,0 "Class"
0,0, W,0,0,0 "Class B"

Can be abbreviated:

@ Data
{1X,3Y,4 "Class"}
{2W,4 "Class B"}

In fact, I do not recommend this because it is easy to confuse with default and unknown values.

0 is saved, not incomplete. Please use an unknown value ?.

Incomplete Value

In general examples, there is no incomplete value, but the actual incomplete value does exist. The general incomplete value refers to the value that is out of the normal range, and it should also be a positive number but a negative number.

The significance of the defect value needs to be studied to find out the cause. For example, the failure of the machine that collects data or the target of the questionnaire survey refuse to answer some privacy questions.

Although most algorithms identify incomplete values as meaningless, they are just unknown. However, we can usually obtain some useful information through the incomplete value.

Incorrect Value

The data used for data mining is not collected for data mining, and some bad data and attribute values are also excited. Some data is irrelevant during initial collection, but it has a great impact on data mining.

The ARFF format defines the attribute type, which can verify the data to a certain extent, but the most reliable one is to check it by yourself.

Duplicate data also requires attention. Sometimes it will produce unexpected interference to the results.

Take data seriously

Do not search for a copy of data to start Data Mining and Exploration. You need to understand the data you want to process and get a rough idea of it. For some abnormal or suspicious data, please talk to relevant personnel repeatedly and ask them to explain the abnormal, incomplete and classified data.

Data pre-processing is very boring and time-consuming, but is necessary for successful data mining. There are even some ideas that the preparation of input data should account for 60% of a data mining project.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.