Use WEKA for Data Mining

Source: Internet
Author: User

1. Introduction

The words Data Mining and machine learning are very difficult for some people. It is true that a lot of background knowledge is required for Algorithm Implementation or even algorithm optimization. However, the fact is that the vast majority of data mining engineers do not need to do algorithm-level things. Their focus is on feature extraction, algorithm selection, and parameter optimization. Therefore, a tool that provides these functions is necessary. WEKA is the best data mining tool.

The full name of WEKA is the Waikato environment for knowledge analysis. It is a free, non-commercial, open-source machine learning and data mining software based on the Java environment. It and its source code can be downloaded from its official website. Interestingly, WEKA, short for this software, is also a unique bird name for New Zealand, while WEKA's major developers also come from the University of Waikato in New Zealand. (This section is taken from Baidu encyclopedia ).

WEKA provides functions such as data processing, feature selection, classification, regression, clustering, association rules, and visualization. This article will give a brief introduction to the use of WEKA, and through simple examples, let everyone know the process of using WEKA. This article will only introduce operations on the graphic interface, and does not involve command line and code.

2. Installation

The official WEKA address is http://www.cs.waikato.ac.nz/mlweka /. Click the download bar on the left to go to the download page, which contains versions for Windows, Mac OS, Linux, and other platforms. We use Windows as an example. The current stable version is 3.6.

If Java is not installed on the local machine, you can select a version with JRE. The downloaded EXE executable file can be installed by double-clicking.

After the installation is complete, open the shortcut for starting WEKA. If you can see the following interface, congratulations, the installation is successful.

Figure 2.1 WEKA startup page

There are 4 applications on the right side of the window, which are

Explorer

An environment used for data experiment and mining. It provides functions such as classification, clustering, association rules, feature selection, and data visualization. (An environment for processing ing data with WEKA)

2) experimentor

Environment used for testing data of different learning schemes. (An environment for faster Ming experiments and conducting statistical tests between learning schemes .)

3) knowledgeflow

Functions are similar to those provided by explorer. However, different interfaces are provided. You can use the drag-and-drop method to create an experiment scheme. In addition, it supports incremental learning. (This environment supports essential the same functions as the explorer but with a drag-and-drop interface. one advantage is that it supports incremental learning .)

4) simplecli

Simple command line interface. (Provides a simple command-line interface that allows direct execution of WEKA commands for operating systems that do not provide their own command line interface .)

3. Data Format

WEKA supports many file formats, including ARFF, xrff, CSV, and even libsvm. Here, ARFF is the most common format.

The full name of ARFF is Attribute-relation file format. The following is an example of an ARFF format file.

%

% ARFF file example

%

@ Relation 'Labor-neg-data'

@ Attribute 'duration' real

@ Attribute 'wage-increase-first-year' real

@ Attribute 'wage-increase-second-year' real

@ Attribute 'wage-increase-Third-year' real

@ Attribute 'cost-of-living-Adjustment '{'none', 'tcf', 'tc '}

@ Attribute 'Working-hours' real

@ Attribute 'pension' {'none', 'ret _ allw', 'empl _ contr '}

@ Attribute 'standby-pay' real

@ Attribute 'shift-differential' real

@ Attribute 'Education-allowance '{'yes', 'no '}

@ Attribute 'statutory-holidays 'real

@ Attribute 'vacation' {'Below _ average', 'average', 'generous '}

@ Attribute 'longterm-disability-exception' {'yes', 'no '}

@ Attribute 'contribution-to-dental-plan' {'none', 'half', 'full '}

@ Attribute 'bereavement-balance '{'yes', 'no '}

@ Attribute 'contribution-to-health-plan' {'none', 'half', 'full '}

@ Attribute 'class' {'bad', 'good '}

@ Data

1, 5 ,?,?,?, 40 ,?,?, 2 ,?, 11, 'average ',?,?, 'Yes ',?, 'Good'

2, 4.5, 5.8 ,?,?, 35, 'ret _ allw ',?,?, 'Yes', 11, 'Below _ average ',?, 'Full ',?, 'Full', 'good'

?,?,?,?,?, 38, 'empl _ contr ',?, 5 ,?, 11, 'generous', 'yes', 'half', 'yes', 'half', 'good'

3, 3.7, 4, 5, 'tc ',?,?,?,?, 'Yes ',?,?,?,?, 'Yes ',?, 'Good'

3, 4.5, 4.5, 5 ,?, 40 ,?,?,?,?, 12, 'average ',?, 'Half', 'yes', 'half', 'good'

2.5 ,?,?, 35 ,?,?, 6, 'yes', 12, 'average ',?,?,?,?, 'Good'

3, 4, 5, 5, 'tc ',?, 'Empl _ contr ',?,?,?, 12, 'generous', 'yes', 'none', 'yes', 'half', 'good'

3, 6.9, 4.8, 2.3 ,?, 40 ,?,?, 3 ,?, 12, 'Below _ average ',?,?,?,?, 'Good'

2, 3, 7 ,?,?, 38 ,?, 12, 25, 'yes', 11, 'Below _ average', 'yes', 'half', 'yes ',?, 'Good'

1, 5.7 ,?,?, 'None', 40, 'empl _ contr ',?, 4 ,?, 11, 'generous', 'yes', 'full ',?,?, 'Good'

3, 3.5, 4, 4.6, 'none', 36 ,?,?, 3 ,?, 13, 'generous ',?,?, 'Yes', 'full', 'good'

2, 6.4, 6.4 ,?,?, 38 ,?,?, 4 ,?, 15 ,?,?, 'Full ',?,?, 'Good'

2, 3.5, 4 ,?, 'None', 40 ,?,?, 2, 'No', 10, 'Below _ average', 'No', 'half ',?, 'Half', 'bad'

This example is from the Labor. ARFF file under the WEKA installation directory data file. It is from the case of labor negotiation in Canada. It predicts the final result of labor negotiation based on the personal information of workers.

In the file, Comments start with "%. The rest can be divided into two parts: header information and data information ).

In the header, the line starting with "@ relation" indicates the Link name, which is in the first line of the entire file (excluding comments ). The format is

@ Relation <relation-Name>

Features starting with "@ attribute", in the format

@ Attribute <Attribute-Name> <datatype>

Attribute-name is the name of a feature, followed by a data type. Common data types include:

1) numeric, number type, including integer and real)

2) Nominal can be considered as an enumeration type, that is, the feature value is a finite set, which can be a string or number.

3) string, string type, the value can be any string.

Starting from "@ data", it is the actual data part. Each row represents an instance and can be considered as a feature vector. The order of each feature corresponds to the attribute in the header one by one, and the feature values are separated by commas. In supervised classification, the last column is the annotation result.

If the value of some features is missing, you can use "?" .

Data Mining Process

The process of using WEKA for data mining is as follows:

Figure 4.1 Data Mining Flowchart

Data preprocessing, training, and verification are performed in WEKA.

1) data preprocessing

Data preprocessing includes Feature Selection, feature value processing (such as normalization), and sample selection.

2) Training

Training includes algorithm selection, parameter adjustment, and model training.

3) Verification

Verify the model results.

The rest of this article takes this process as the main line and uses classification as an example to describe how to use WEKA for data mining.

5. data preprocessing

Open the explorer interface and click "Open File". Under the WEKA installation directory, select the "labor. ARFF" file in the data directory. The following page is displayed. The entire area is divided into seven parts. The functions of each part are described below.

Figure 5.1 Explorer Page

1) Area 1 has 6 tabs in total to select different data mining function panels, from left to right are preprocess, classify, and cluster), associate, select attribute, and visualize ).

2) Area 2 provides the ability to open, save, and edit files. To open a file, you can not only select it locally, but also use URL and DB as the data source. The generate button provides the data generation function. WEKA provides several data generation methods. Click Edit. The following page is displayed:

Figure 5.2 ARFF Viewer

On this page, you can see the values of each column in each row, right-click the name of each column, and you can see some data editing functions. These functions are more practical.

3) Area 3 is named filter. Some people may think of the filter method in feature selection. In fact, filter provides a lot of operation methods for feature and instance, powerful functions.

4) In area 4, you can view the current features and sample information, and provide feature selection and deletion functions.

5) When you select a feature with the mouse in area 4, Area 5 displays the feature information. Including the minimum, maximum, expectation, and standard deviation.

6) Area 6 provides a visualization function. After selecting a feature, the area displays the distribution of feature values in each interval. Different category labels are displayed in different colors.

7) Area 7 is the status bar. When there is no task, the bird is sitting. When the task is running, the bird will stand up and swing around. If the bird is standing but does not rotate, the task is faulty.

The following describes the functions of filters through an instance.

Click the choose button under the filter to see the following interface:

Figure 5.3 filter method selection page

Filters can be divided into two categories: supervised and unsupervised. Category labels are required for methods under supervised, while unsupervised does not. Attribute category indicates filtering features, and instance indicates selecting samples.

1) Case 1: feature value Normalization

This function has nothing to do with category and is for Attribute. We select normalize under unsupervised-> attribute. Click the region where normalize is located. The following page is displayed. In the left-side window, you can select several parameters. Click More to display the window on the right. The window details this function.

Figure 5.4 normalization parameter setting page

Use the default parameters and click OK to return to the main window. Select one or more features to be normalized in area 4, and click Apply. In the visualization area, we can see that the feature values are normalized from 1 to 3 to 0 to 1.

2) Case 2: classifier feature screening

This function is related to the category. Select attributeselection under supervised-> attribute. There are two options on this interface. evaluator is a method to evaluate the effectiveness of feature sets, and search is a method to search feature sets. Here, we use informationgainattributeeval as the evaluator and ranker as the search, indicating that we will sort the features based on the feature information gain value. You can set a threshold in ranker. features lower than this threshold will be thrown away.

Figure 5.7 feature selection parameters

Click Apply. You can see that the features in Area 4 are reordered and those lower than the threshold are deleted.

3) Case 3: select the sample with the classifier's error score

Select removemisclassified under unsupervised-> instance. You can see six parameters. classindex is used to set the category tag and classifier is used to select the classifier. Here we select j48 decision tree and invert we select true, in this way, the error sample is retained. numfolds is used to set cross-validation parameters. After setting the parameters, click apply. The number of samples is reduced from 57 to 7.

Figure 5.10 parameter settings

6. Category

Open the classifer tab in Explorer. The entire interface is divided into several areas. They are

1) classifier

Click the choose button to select the classifier provided by WEKA. Common classifiers include

A) Na under Bayes? Ve Bayes and bayesnet ).

B) liblinear and libsvm under functions (extension packages are required), logistic regression, and linear regression.

C) ib1 (1-nn) and ibk (KNN) under lazy ).

D) Many boosting and bagging classifiers in Meta, such as syststm1.

E) j48 (WEKA C4.5) and randomforest under trees.

2) test options

There are four options for evaluating the model effect.

A) use training set: use the training set, that is, the training set and Test Set use the same data. This method is generally not used.

B) supplied Test Set: Set the test set to a local file or URL. The format of the test file must be the same as that of the training file.

C) cross-validation: A common verification method. N-folds cross-validation refers to dividing the training set into N parts, using N-1 parts for training, using 1 parts for testing, so loop N times, and finally the overall calculation result.

D) percentage split: according to a certain proportion, the training set is divided into two parts, one for training and one for testing.

Below these verification methods, there is a more options option, you can set some model output, model validation parameters.

3) Result List

This area stores the history of the classification experiment, right-click the record, you can see a lot of options. Common options include saving, loading, and visualizing models.

4) classifier output

The output result of the classifier. The default output option is run information, which provides some summary information for feature, sample, and model verification. classifier model provides some parameters of the model, different classifiers provide different information. The following are the results of model verification and some common verification criteria, such as precision, recall, and true positive rate ), false positive rate, F-measure, and ROC area. Confusion matrix provides the classification of the test sample, through which you can easily see the number of samples of the correct classification or error classification.

Case 1: Use j48 to classify labor files

1) Open the labor. ARFF file and switch to the classify panel.

2) select trees-> j48 classifier and use the default parameters.

3) Select the default cross-validation option for test options, click more options, and select output predictions.

4) Click start to start the experiment.

5) in the classifier output on the right, we can see the experiment results.

Figure 6.1 run information

The classifier used in the experiment and its specific parameters, the experiment name, the number of samples, the number of features, the features used, and the test mode are provided.

Figure 6.2 Model Information

The generated decision tree, the number of leaf nodes, the number of Tree nodes, and the model training time are provided. If this is not intuitive, right-click the experiment you just performed in the result list and click the Visualize tree to view the decision tree on the graphic interface.

Figure 6.3 Decision Tree

Next is the prediction result. We can see the information about the actual classification, prediction classification, error score, and prediction probability of each sample.

Figure 6.4 prediction result

The following are the verification results. The overall accuracy is 73.68%, the bad class accuracy is 60.9%, the recall rate is 70.0%, the good class accuracy is 82.4%, And the recall rate is 75.7%.

Figure 6.5 model performance evaluation results

7. Visualization

Open the Visualize panel of explorer and you can see a two-dimensional graph matrix at the top. The rows and columns of this matrix are all features (including category labels ), column J of row I represents the distribution of feature I and feature J on a two-dimensional plane. Each vertex in the image represents a sample. Different classes use different colors.

Below are several options: plotsize can adjust the image size, pointsize can adjust the size of sample points, jitter can adjust the distance between points, sometimes the points are too concentrated, you can adjust the jitter to separate them.

Figure 7.1 plot matrix Two-Dimensional diagram

It is a graph of duration and class features. We can see that duration is not a good feature, and the distribution of good and bad is similar in each feature interval.

Click the image in a certain area to bring up another window. This window shows the distribution between two features. The difference is that, here, by clicking the sample point, the sample details are displayed.

Visualization can also be used to view samples with incorrect scores. This is a very practical function. After classification, right-click the category record in the result list and select visualize classify errors. The following window is displayed.

Figure 7.2 visualization of mis-segmentation Samples

In this window, the Cross indicates the correct classification of samples, the square indicates the wrong classification sample, the X axis indicates the actual category, the Y axis indicates the prediction category, and the blue indicates the actual bad, red indicates the actual good. In this way, the blue square indicates that the sample is actually bad, but it is mistakenly divided into good samples. The Red Square Block indicates that the sample is actually good and is mistakenly divided into bad samples. Click these points to view the feature values of the sample and analyze why the sample was mistakenly divided.

Next, we will introduce a more practical function. Right-click the record in the result list, select visualize threshold curve, and select a category.

Figure 7.3 threshold curve

This figure shows the comparison of classification performance evaluation criteria with different classification confidence levels under different thresholds. The comparison between false positive rate and true positive rate is given under different thresholds. In fact, the ROC curve is given. We can easily observe the distribution of different evaluation criteria by selecting the color. If the x-axis and Y-axis select the accuracy and recall rate, we can use this graph to make trade-off between the two values and select a suitable threshold value.

Other visualization functions are not described in detail.

8. Summary

This article only introduces some features of WEKA's Explorer interface. Other features of explorer, such as clustering, association rules, feature selection, and use of experimentor and knowledgeflow interfaces, see the official WEKA documentation.

In addition, WEKA supports extension packages, which can easily include open-source tools such as liblinear and libsvm.

In Linux, you can use the WEKA command line for an experiment. For more information, see the WEKA official documentation.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.