(EXT) http://baidutech.blog.51cto.com/4114344/1033714/
1. Introduction
Data mining, machine learning these words, in some people's view, is the threshold of very high things. Admittedly, it does require a lot of background knowledge to do algorithm implementation and even algorithm optimization. But the fact is, the vast majority of data mining engineers do not need to do the algorithm level of things. Their energies are concentrated on feature extraction, algorithm selection and parameter tuning. Then, it is necessary to have a tool that can conveniently provide these functions. And Weka is the leader in data mining tools.
The full name of Weka is the Waikato Intelligent Analytics Environment (Waikato Environment for knowledge analysis), a free, non-commercial, open-source machine learning and data mining software based on the Java environment. It and its source code can be downloaded on its official website. Interestingly, the acronym WEKA of the Software is also a unique bird name for New Zealand, while the main developer of Weka happens to come from the University of Waikato in New Zealand. (This paragraph is excerpted from Baidu Encyclopedia).
Weka provides functions such as data processing, feature selection, classification, regression, clustering, association rules, visualization, etc. This article will be a simple introduction to the use of Weka, and through a simple example, so that you understand the process of using Weka. This article will only introduce the operation of the graphical interface, not the command line and the code level of things.
2. Installation
The official address of Weka is http://www.cs.waikato.ac.nz/ml/weka/. Click on the left download bar, you can go to the download page, which has Windows,mac os,linux and other platforms under the version, we take the Windows system as an example. The current stable version is 3.6.
If the machine does not have Java installed, you can choose the version with the JRE. After the download is an EXE executable file, double-click to install.
After installation, open the start Weka shortcut, if you can see the following interface, then congratulations, the installation was successful.
Figure 2.1 Weka Start-up interface
There are 4 applications on the right side of the window, respectively
1) Explorer
Used for data experiment, mining environment, it provides classification, clustering, association rules, feature selection, data visualization functions. (An environment to exploring data with WEKA)
2) Experimentor
An environment for experimenting with data testing for different learning scenarios. (an environment for performing experiments and conducting statistical tests between learning schemes.)
3) Knowledgeflow
The function and the Explorer are similar, but the interface is different, the user can use drag-and-drop way to set up the experiment plan. In addition, it supports incremental learning. (This environment supports essentially the same functions as the Explorer, but with a drag-and-drop interface. One advantage is the IT supports incremental learning.)
4) Simplecli
Simple command-line interface. (provides a simple command-line interface, allows direct execution of WEKA commands for operating systems Provide their own command line interface.)
3. Data format
Weka supports a wide range of file formats, including Arff, Xrff, CSV, and even LIBSVM formats. Of these, Arff is the most commonly used format, and we only introduce this one here.
Arff full name is attribute-relation file format, the following is an example of a Arff format.
%
% Arff File Example
%
@relation ' Labor-neg-data '
@attribute ' duration ' real
@attribute ' wage-increase-first-year ' real
@attribute ' wage-increase-second-year ' real
@attribute ' wage-increase-third-year ' real
@attribute ' cost-of-living-adjustment ' {' None ', ' TCF ', ' TC '}
@attribute ' working-hours ' real
@attribute ' pension ' {' None ', ' ret_allw ', ' empl_contr '}
@attribute ' Standby-pay ' real
@attribute ' shift-differential ' real
@attribute ' education-allowance ' {' yes ', ' no '}
@attribute ' statutory-holidays ' real
@attribute ' Vacation ' {' below_average ', ' average ', ' generous '}
@attribute ' longterm-disability-assistance ' {' yes ', ' no '}
@attribute ' Contribution-to-dental-plan ' {' None ', ' half ', ' full '}
@attribute ' bereavement-assistance ' {' yes ', ' no '}
@attribute ' Contribution-to-health-plan ' {' None ', ' half ', ' full '}
@attribute ' class ' {' bad ', ' good '}
@data
1,5,?,?,?,,?,?, 2,?,, ' average ',?,?, ' yes ',?, ' good '
2,4.5,5.8,?,?,, ' ret_allw ',?,?, ' yes ', one, ' below_average ',?, ' full ',?, ' full ', ' good '
?,?,?,?,?,, ' empl_contr ',?, 5,?, One, ' generous ', ' yes ', ' half ', ' yes ', ' half ', ' good '
3,3.7,4,5, ' TC ',?,?,?,?, ' yes ',?,?,?,?, ' yes ',?, ' good '
3,4.5,4.5,5,?,,?,?,?,?,, ' average ',?, ' half ', ' yes ', ' half ', ' good '
2,2,2.5,?,?,,?,?, 6, ' yes ', ' average ',?,?,?,?, ' good '
3,4,5,5, ' TC ',?, ' empl_contr ',?,?,?,, ' generous ', ' yes ', ' none ', ' yes ', ' half ', ' good '
3,6.9,4.8,2.3,?,,?,?, 3,?,, ' below_average ',?,?,?,?, ' good '
2,3,7,?,?,,?, 12,25, ' yes ', one, ' below_average ', ' yes ', ' half ', ' yes ',?, ' good '
1,5.7,?,?, ' None ', max, ' Empl_contr ',?, 4,?, One, ' generous ', ' yes ', ' full ',?,?, ' good '
3,3.5,4,4.6, ' None ',,?,?, 3,?,, ' generous ',?,?, ' yes ', ' full ', ' good '
2,6.4,6.4,?,?,,?,?, 4,?,,?,?, ' full ',?,?, ' good '
2,3.5,4,?, ' None ',,?,?, 2, ' No ', ten, ' below_average ', ' no ', ' half ',?, ' half ', ' bad '
This example comes from the Labor.arff file under the Weka installation directory data file, derived from the case of the Canadian labor negotiations, which predicts the final outcome of the Labour negotiations based on the worker's personal information.
In the file, "%" begins with a comment. The remainder can be divided into two parts, header information (header information) and data information (information).
In the header information, the line beginning with "@relation" represents the relationship name, in the first line of the entire file (minus the comment). Format is
@relation <relation-name>
A representative feature that begins with "@attribute" and is formatted as
@attribute <attribute-name> <datatype>
Attribute-name is the name of the feature, followed by the data type, with the following kinds of common data types
1) numeric, numeric type, including integer (integer) and real (real number)
2) nominal, can be considered an enumeration type, that is, the eigenvalues are a limited set, can be a string or a number.
3) string, type, value can be any string.
Starting with "@data" is the actual data section. Each row represents an instance and can be thought of as a eigenvectors. The order of each feature corresponds to the attribute in the header information, and the eigenvalues are separated by commas. In supervised classification, the last column is the result of the callout.
If the value of some features is missing, you can use the "? instead
Data mining process
The process of using Weka for data mining is as follows
Figure 4.1 Data Mining flowchart
Among them, in the Weka is the data preprocessing, training, verification of these three steps.
1) Data preprocessing
Data preprocessing includes feature selection, eigenvalue processing (such as normalization), sample selection and other operations.
2) Training
Training includes algorithm selection, parameter adjustment, model training.
3) Verification
Validates the results of the model.
The remainder of this article takes this process as the main line and classifies it as an example, and describes the steps for data mining using Weka.
5. Data preprocessing
Open Explorer interface, click "Open File", in the Weka installation directory, select the "Labor.arff" file in the data directory, you will see the following interface. We divide the entire area into 7 parts, each of which is described below.
Figure 5.1 Explorer Interface
1) Region 1 A total of 6 tabs, used to select different data mining features panel, from left to right in turn preprocess (preprocessing), classify (classification), Cluster (clustering), Associate (Association Rules), select Attribute (feature selection) and visualize (visualization).
2) Zone 2 provides the ability to open, save, and edit files. Open files can be selected not only directly from the local area, but also by using URLs and DB as data sources. The Generate button provides data generation capabilities, and Weka provides several ways to generate data. Click on edit to see the following interface
Figure 5.2 Arff Viewer
In this interface, you can see the columns of each row of the corresponding values, right-click the name of each column, you can see some of the functions of editing data, these features are more practical.
3) Region 3 is filter, some people may associate with the filter method in the feature selection, in fact, the filter for features (attribute) and Samples (instance) provides a number of operational methods, the function is very powerful.
4) In Region 4, you can see the current characteristics, sample information, and provide feature selection and deletion function.
5) After you select a single feature in area 4 with the mouse, area 5 displays information about that feature. Includes minimum, maximum, expectation, and standard deviation.
6) Area 6 provides a visualization feature that, when selected, displays the distribution of the eigenvalues in each interval, with different category labels displayed in different colors.
7) Area 7 is the status bar, when there is no task, the bird is sitting, when the task runs, the bird will stand up and sway around. If the bird stands but does not turn, it indicates a problem with the task.
The following will introduce the functions of filters through an example.
Click the Choose button below the filter to see the following screen
Figure 5.3 Filter Method selection interface
Filters can be divided into two major categories, supervised and unsupervised. The method under supervised requires a category label, while unsupervised does not. The attribute category indicates that the feature is filtered, and instance represents the selection of the sample.
1 Case 1: Normalization of eigenvalues
This feature is not related to categories, and is for attribute, we choose normalize below unsupervised-attribute. To open the normalize area, you will see the following interface. On the left side of the window, there are several parameters to choose from. Click More to see the window on the right, which details this feature.
Figure 5.4 Normalization of the parameter settings interface
Using the default parameters, click OK to return to the main window. In Region 4, select the feature that will be normalized, one or more, and then click Apply. In the visualization area, we can see that the eigenvalues from 1 to 3 are normalized from 0 to 1.
2 Case 2: Classifier feature filter
This feature is relevant to the category, select the attributeselection below supervised-attribute. The interface has two options, Evaluator is the method to evaluate the validity of the feature set, and search is the method of feature set searching. Here we use Informationgainattributeeval as the evaluator, using Ranker as search, which means that we will sort the features based on the information gain value of the feature. Thresholds can be set in Ranker, and features below this threshold will be discarded.
Figure 5.7 Feature Selection parameters
Click Apply to see that the feature is reordered in Region 4, below the threshold value has been deleted.
3 Case 3: Select a sample of the classifier's wrong score
Select unsupervised-instance below the removemisclassified, you can see 6 parameters, Classindex used to set the category label, classifier to select the classifier, here we choose J48 Decision Tree, Invert we choose true, this preserves the wrong sub-sample and numfolds the parameters used to set the cross-validation. After setting the parameters, click Apply and you can see that the number of samples has been reduced from 57 to 7.
Figure 5.10 Parameter settings
6. Classification
In Explorer, open the Classifer tab and the entire interface is divided into several areas. respectively is
1 ) Classifier
Click the Choose button to select the classifier provided by the Weka. The commonly used classifiers are
A) Naïve Bayes (naive Bayesian) and bayesnet (Bayesian belief network) under Bayes.
b) Liblinear, LIBSVM (these two need to install expansion packs), Logistic Regression, Linear Regression under functions.
c) IB1 (1-NN) and IBK (KNN) under the lazy.
D) Many boosting and bagging classifiers under meta, such as AdaBoostM1.
e) Trees under the J48 (Weka version of C4.5), Randomforest.
2 ) Test Options
There are four options for evaluating the effect of a model.
A) Use training set: Using the same data for the training set, that is, the training set and the test set, this method is generally not used.
b) supplied test set: Set the test set, you can use a local file or URL, the format of the test file needs to be consistent with the training file format.
c) cross-validation: cross-validation, a very common validation method. N-folds cross-validation refers to the training set is divided into n parts, using N-1 to do the training, using 1 copies of the test, so the loop n times, the overall calculation results.
d) Percentage split: According to a certain percentage, the training set is divided into two parts, one for training, one for testing.
Below these validation methods, there is a more options option that allows you to set some model output, model validation parameters.
3 ) Result list
This area holds the history of the classification experiment, right click on the record, you can see many options. There are several options for saving or loading models and visualizations.
4 ) Classifier Output
The output of the classifier, the default output option has run information, which gives some summary information of the feature, sample and model validation; Classifier model gives some parameters of the models, and different classifiers give different information. The bottom is the results of the model validation, with some commonly used validation criteria, such as accuracy (Precision), Recall (Recall), true positive rate (true positive rates), false positive rate (false positive), F-Value (f-measure), ROC area (ROC areas), etc. The confusion matrix gives the classification of the test sample, which makes it easy to see the number of samples of a class that are properly categorized or incorrectly categorized.
Case 1 : Use J48 to classify labor files
1) Open the Labor.arff file and switch to the classify panel.
2) Select the trees->j48 classifier and use the default parameters.
3) Test Options Select the default 10 cross-validation, open more options, tick the output predictions.
4) Click the Start button to start the experiment.
5) in the classifier output on the right, we see the results of the experiment.
Figure 6.1 Run Information
The classifier used in the experiment and the specific parameters, the name of the experiment, the number of samples, the number of characters, the characteristics and the test mode are given.
Figure 6.2 Model information
In this paper, the decision tree is given, the number of leaf nodes, the node count of the tree, and the training time of the model. If you feel that this is not intuitive, you can right click on the result list inside the experiment, click Visualize tree, you can see the graphical interface decision tree, very intuitive.
Figure 6.3 Decision Tree
Further down is the prediction result, which can be seen in the actual classification of each sample, the prediction of classification, whether the wrong points, the prediction probability of this information.
Figure 6.4 Prediction Results
The bottom is the verification results, the overall accuracy is the 73.68%,bad class accuracy is 60.9%, recall rate 70.0%,good class accuracy is 82.4%, recall rate of 75.7%.
Figure 6.5 Model effect evaluation results
7. Visualization
Open Explorer's visualize panel, you can see the top is a two-dimensional graph matrix, the row and column of the Matrix are all features (including category labels), the first row of the J column represents the feature I and feature J on the two-dimensional plane distribution. Each point on the graph represents a sample, and different categories use different color identifiers.
Here are a few options, plotsize can adjust the size of the graph, pointsize can adjust the size of the sample points, jitter can adjust the distance between points, sometimes the point is too concentrated, you can adjust the jitter to disperse them.
Figure 7.1 Plot matrix two-dimensional diagram
is the duration and class two characteristics of the graph, it can be seen that duration is not a good feature, in each characteristic value interval, good and bad distribution is similar.
Clicking on the graphic of an area pops up another window, which gives a graph of the distribution of the two features, the difference being that, here, you can pop up the details of the sample by clicking on the sample point.
Visualization can also be used to see samples of the wrong score, which is a very useful feature. After the classification, in the result list right click on the classification of records, select visualize classify errors, will pop up the following window.
Fig. 7.2 Visualization of the wrong sample
Inside this window, the cross represents the correct sample, the block represents the wrong sample, the x-axis is the actual category, the y-axis is the predicted category, the Blue is the actual bad, and the red is the actual good. In this way, the blue squares are actually bad, but they are divided into good samples, the red squares represent the actual good and are mistakenly divided into bad samples. Clicking on these points allows you to see the individual eigenvalues of the sample and analyze why the sample was mistakenly divided.
Introduce a more practical function, right click on the record in the result list, choose visualize threshold curve, and then select the category, you can see the shape
Figure 7.3 Threshold Curve
The graph gives the comparison of classification effect evaluation criteria under different threshold values for classification confidence level. The comparison between the false positive ratio and the true positive ratio under different thresholds is given, in fact, the ROC curve is given. We can easily observe the distribution of different evaluation criteria by choosing a color. If the x-axis and y-axis are selected for accuracy and recall, then we can use this graph to do trade-off between the two values and choose a suitable threshold.
Other visualization features are not covered.
8. Summary
This article only introduces some functions of Weka's Explorer interface, other functions of the Explorer, such as clustering, association rules, feature selection, and Experimentor and Knowledgeflow interface use, can refer to the official documents of Weka.
In addition, Weka supports extension packages, which makes it easy to put open source tools such as Liblinear and LIBSVM in.
Under Linux, you can use the Weka command line to experiment, the specific use of the method, please refer to the Weka official documentation.
With such an open source, free, powerful data mining tool, what are you waiting for? Data mining engineers who haven't used Weka, hurry up and move on.
by Weizheng
This article is from the "Baidu Technology blog" blog, please be sure to keep this source http://baidutech.blog.51cto.com/4114344/1033714
Weka Usage Introduction