weka– classification

Source: Internet
Author: User

1. Weka Introduction

1) Weka is a machine learning/data mining open source software developed by the Weka team of the University of Waikato in New Zealand.

2) Related Resources link

http://sourceforge.net/projects/weka/files/

http://www.cs.waikato.ac.nz/ml/weka/

3) Main Features

    • Integrated Data Mining tool integrating data preprocessing, learning algorithms (classification, regression, clustering, correlation analysis) and evaluation methods
    • With interactive visual interface
    • Provides algorithmic learning comparison environment
    • Through its interface, it can realize its own data mining algorithm.

2. Data set (. arff file)


The data set is rendered as shown in a two-dimensional table, where:

    • A row in the table is called an instance (Instance), which is equivalent to a sample of statistics, or a record in a database
    • A column in a table is called an attribute (Attribute), which is equivalent to a variable in statistics, or a field in a database

The storage format of a dataset is an ASCII text file, as shown in the Arff file, which can be divided into two parts:

    • The first part gives the header information (headinformation), including the Declaration of the Relationship and the Declaration of the property
    • The second part gives the data information (datainformation), which is the data in the data set. Start with the "@data" tag, followed by the data information

Note: Where the comment section starts with "%", the comment section Weka ignores these lines;

If the relationship name, attribute name, data string contains spaces, it must be enclosed in quotation marks;

The last declared property is called the Class attribute, which is the default target variable in the classification or regression task.

3. Data type

1) Weka supports four types of data, respectively:

    • Numeric numerical type

numeric types can be integers (integer) or real numbers, and Weka treat them as real numbers.

    • Nominal nominal type

The nominal attribute is placed in curly braces by a series of category names.

    • String type

The string property can contain arbitrary text.

    • Date and Time type

Date and time properties are uniformly represented by the "date" type, and the default string is the date-time combination format given by ISO-8601: "Yyyy-mm-dd HH:mm:ss"

eg. @ATTRIBUTE timestamp DATE "Yyyy-mm-dd HH:mm:ss"

@DATA "2015-06-23 20:05:40"

2) Sparse data

When a dataset contains a large number of 0 values, the data storage in sparse format is more space-saving. The sparse format is for the representation of an object in the data information and does not need to modify other parts of the Arff file. For example:

@data @data

0, X, 0, Y, "Class A" {1 X, 3 Y, 4 "Class A"}

0, 0, W, 0, "class B" {2 W, 4 "ClassB"}

4. Data preparation

. xls. csv. Arff

5. Classification classify

1) Classification Process

According to a set of characteristic attributes (input variables) of a Weka instance, the target attribute is classified and forecasted. To achieve this, we need to have a training data set that is known for the inputs and outputs of each instance of the dataset. Observing the examples in the training set, we can establish the classification/regression model of the prediction. With this model, new unknown instances can be classified and predicted. The measurement of the model is mainly due to the accuracy of the prediction.

2) Examples of data projections

A. note that the settings for each property declaration section of the test dataset and training dataset must be consistent . Even if you do not have a value for the class attribute in the test dataset, you need to add this property to set the value of the property on each instance to the missing value.

B. Open the "Simple CLI" module, the command format using the "J48" algorithm is:

java weka.classifiers.trees.j48-c0.25-m 2-t "c:\\users\\administrator\\desktop\\ project \ \ data file \ \ Test Data \\2.3 Reference-correlation analysis + Data transformation." Csv.arff "-D" c:\\users\\ Administrator\\desktop\\ project \ \ data file \ \ Test Data \\2.3 Reference-correlation analysis + Data transformation. Model "

Here the "2.3 References-correlation analysis + Data transformation. Csv.arff" is the training data set. Where the parameter "-C 0.25" represents a confidence factor,"-m 2" represents the minimum number of instances. " - T " followed by the full path to the training data set, followed by"-D" to save the full path to the model.

C. The command format to apply this model to the test data set is:

Java weka.classifiers.trees.j48-p 11-l "c:\\users\\administrator\\desktop\\ topic \ \ data file \ \ Test Data \\2.3 Reference-correlation analysis + Data transformation. Model "-T" c:\\users\\administrator\\desktop\\ project \ \ data file \ \ Test Data \\3.3 the reference document extracted from the dissertation. Csv.arff "

where "-P-one" refers to the real value of the predicted attribute in the model in the 11th attribute, followed by"-L" as the full path of the model," - t" is followed by the full path of the test data set.

D. After entering the above command, the result appears:

===predictions on test data = = =

inst# actual predicted error prediction ()

1 1:? 1:j 1

2 1:? 1:j 1

3 1:? 2:m 0.667

4 1:? 2:m 0.667

5 1:? 3:c 1

6 1:? 2:m 0.667

The first column is the instance number, the second is the value of the original class attribute in the test dataset, the third is the predicted result, and the fourth is the confidence level of the predicted result, for example 1, with 100% certainty that its value is J.

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

weka– classification

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.