Weka Algorithm Introduction

Source: Internet
Author: User
Tags random seed two factor

Rweka(http://cran.r-project.org/web/packages/RWeka/index.html):
1) data input and output
WOW (): View the parameters of the Weka function.
Weka_control (): Sets the parameters of the Weka function.
Read.arff (): reads the data weka attribute-relation File format (ARFF).
Write.arff: Writes data to Weka attribute-relation file format (ARFF).
2) Data preprocessing
Normalize (): Unsupervised standardized continuity data.
Discretize (): With MDL (Minimum Description Length) method, there is supervised discretization of continuous numerical data.
3) Classification and regression
IBk (): K Nearest Neighbor Category
LBR (): Naive Bayes Method classification
J48 (): C4.5 Decision tree Algorithm (decision tree is completely independent when analyzing individual attributes).
LMT (): Combined tree structure and logistic regression model, each leaf node is a logistic regression model, the accuracy is better than the individual decision tree and logistic regression method.
m5p (): M5 model number algorithm, combining tree structure and linear regression model, each leaf node is a linear regression model, so it can be used for continuous data regression.
Decisionstump (): Single-layer decision tree algorithm, which is often used as a basic learning device for boosting.
SMO (): Support Vector Machine classification
AdaBoostM1 (): Adaboost M1 method. The-w parameter specifies the algorithm for the weak learner.
Bagging (): Create multiple models by sampling from raw data (with substitution methods).
Logitboost (): The weak learner uses the logarithmic regression method to learn the real value
Multiboostab (): The improvement of the AdaBoost method can be seen as a combination of AdaBoost and "wagging".
Stacking (): An algorithm for the integration of different basic classifiers.
Linearregression (): Establish a suitable linear regression model.
Logistic (): Establish logistic regression model.
Jrip (): A rule learning method.
M5rules (): Use M5 method to produce the decision rule of regression problem.
OneR (): A simple 1-r taxonomy.
Part (): Produces a part decision rule.
4) Clustering
Cobweb (): This is a model-based approach that assumes each clustering model and discovers data that fits the model. It is not appropriate to cluster large databases.
Farthestfirst (): Fast approximate K-mean clustering algorithm
Simplekmeans (): K-Mean clustering algorithm
Xmeans (): Improved K-mean method, can automatically determine the number of categories
DBScan (): A density-based clustering method that continuously grows clusters based on the density surrounding the object. It can find any shape clustering from a spatial database containing noise. This method defines a cluster as a set of points for a set of "Density joins."
5) Association rules
Apriori (): Apriori is the most influential basic algorithm in the field of association rules, and is a breadth-first algorithm, which obtains frequent itemsets with support degree greater than the minimum support degree by scanning the database multiple times. Its theoretical basis is the two monotonicity principle of frequent itemsets: any subset of frequent itemsets must be frequent; the Zhing set of non-frequent itemsets must be non-frequent. In the case of massive data, the time and space cost of the Apriori algorithm is very high.
Tertius (): Tertius algorithm.
6) Prediction and evaluation:
Predict (): Predicting categories of new data based on classification or clustering results
Table (): Comparison of two factor objects
Evaluate_weka_classifier (): Evaluates the execution of the model, such as TP RATE,FP rate,precision,recall,f-measure.

3. Classification and regression 
Background knowledge
Weka the Classification (classification) and regression (Regression) in the "Classify" tab, for a reason.
In both of these tasks, there is a target attribute (output variable). We want to predict the target based on a set of characteristics (input variables) of a sample (called an instance in Weka). To achieve this, we need to have a training data set that is known for the inputs and outputs of each instance of the dataset. The predictive model can be established by observing the examples in the training set. With this model, we can predict the new output of unknown instances. The measure of a model is how accurate it is.
In Weka, the target (output) to be predicted is called the class attribute, which should be "classes" from the classification task. In general, our task is called classification when the class attribute is a type, and our task is called regression when the class attribute is numeric.

Selection algorithm
In this section, we use the C4.5 Decision tree algorithm to establish a classification model for Bank-data.
Let's look at the original "Bank-data.csv "file. The "ID" attribute is definitely not required. Since the C4.5 algorithm can handle numeric properties, we do not have to disperse each variable as a component type with association rules as before. Nonetheless, we will convert the "children" attribute to the two values "YES" and "NO" of the component type. In addition, our training set takes only half of the original dataset instance, and extracts several bars from the other half as instances to be predicted, and their pep attributes are set to missing values. The training set data is downloaded here after these processing, and the data for the forecast set is downloaded here. 

We use "Explorer" to open the training set "Bank.arff" and see if it is handled according to the previous requirements. Switch to the "Classify" tab and click on the "Choose" button to see a number of categories or regression algorithms grouped in a tree-shaped box. In version 3.5 of the Weka, there is a "filter ..." button underneath the tree box, and clicking can filter out inappropriate algorithms based on the characteristics of the dataset. The input attributes of our dataset are "binary" (that is, only two classes of categorical type) and numeric attributes, and the class variable is "binary", so we tick "binary attributes" "Numeric attributes" and " Binary class ". Click "OK" back to the tree diagram, you can find some algorithm names turn red, indicating that they can not be used. Choose "Trees" under the "J48", this is the C4.5 algorithm we need, fortunately it does not turn red. 
Click on "Choose" to the right of the text box, pop up a new window for the algorithm to set various parameters. Click "More" to view the parameter description, and the point "capabilities" is to view the algorithm's scope of application. Here we leave the parameters to the default. 
Now look at the "Test Option" in the left. We do not specifically set up the inspection data set, in order to ensure the accuracy of the resulting model is not over-fitting (overfitting) phenomenon, we need to use 10 percent cross-validation (10-fold validation) to select and evaluate the model. If you do not understand the meaning of cross-validation can Google a bit. 

Modeling Results
OK, select "Cross-validation" and fill "10" in the "Folds" box. Click the "Start" button to start the algorithm to generate a decision tree model. Soon, a decision tree expressed in text, and an error analysis of the decision tree, are shown in the "Classifier output" on the right. At the same time, the lower left "Results list" appears with an item showing just the time and algorithm name. If you change a model or change a parameter and start again, the Results list will have one more item.

We see that one of the results of cross-validation of the "J48" algorithm is
Correctly classified Instances 206 68.6667%
This means that the accuracy of this model is only about 69%. Perhaps we need to deal with the original attribute or modify the algorithm's parameters to improve accuracy. But here we don't care about it and continue to use this model.

Right-click on the "Results list" that just appeared, the pop-up menu select "Visualize Tree", the new window can see the graphical mode of the decision tree. It is recommended to maximize this new window, then right-click on "Fit to Screen", you can see this tree clearly. After reading or turning off

Here we explain the meaning of "confusion Matrix".
= = = Confusion Matrix = = =
A B <--classified as
74 64 | A = YES
30 132 | b = NO
This matrix is said that the original "pep" is the "yes" instance, there are 74 are correctly predicted as "yes", there are 64 errors in the prediction of "no", the original "Pep" is "no" instance, there are 30 errors predicted as "yes", there are 132 correct predictions of "no". 74+64+30+132 = 300 is the total number of instances, while (74+132)/300 = 0.68667 is exactly the proportion of correctly categorized instances. The larger the number on the diagonal of the matrix, the better the prediction.

Model application
Now we are going to use the resulting model to predict the datasets to be predicted, and note that the settings for the individual properties of the dataset and training dataset must be consistent. Weka does not directly provide a way to apply a model to a dataset with a prediction, we need to take an indirect approach.
Select "Supplied Test set" in "Test Opion" and "set" to "Bank-new.arff" file. Re-"Start" once. Note that the resulting model is not selected by cross-validation, and the error analysis given by "Classifier output" does not make much sense. This is also a disadvantage of indirect prediction.
Now, right click on the first item in the "Result list" and select "Visualize classifier errors". We're not going to control what the diagram in the new window means, point to the "Save" button and save the result as "Bank-predicted.arff". This Arff file has the predicted results we need. Open the new file in the Preprocess tab of Explorer and you can see two more properties "Instance_number" and "Predictedpep". "Instance_number" refers to the position of an instance in the original "Bank-new.arff" file, and "Predictedpep" is the result of the model prediction. Click the "Edit" button or open it in the "Arffviewer" module to view the contents of this dataset. For example, our pep forecast for instance 0 is "YES" and the predicted value for instance 4 is "NO".

Using the command line (recommended)
While it is convenient to use a graphical interface to view results and set parameters, the most straightforward and flexible way to model and apply them is to use the command line.
Open the Simple CLI module, and the command format using the "J48" algorithm as above is:
Java weka.classifiers.trees.j48-c 0.25-m 2-t directory-path\bank.arff-d directory-path \bank.model
The parameters "-C 0.25" and "-M 2" are the same as those set in the graphical interface. The "-T" followed by the full path to the training dataset (including the directory and file name) followed by the full path to the save model. Attention! Here we can save the model.
After entering the above command, the resulting tree model and error analysis will be displayed above "simple CLI" and can be copied and saved in a text file. The error is to apply the model to the training set.
The format of the command used to apply this model to "Bank-new.arff" is:
Java weka.classifiers.trees.j48-p 9-l directory-path\bank.model-t Directory-path \bank-new.arff
Where "-P 9" is said to be the class attribute in the model is 9th (that is, "pep"), "-L" is followed by the full path of the model, "-T" is followed by the full path of the dataset to be predicted.
After entering the above command, there are some results above "simple CLI":
0 YES 0.75?
1 NO 0.7272727272727273?
2 YES 0.95?
3 YES 0.8813559322033898?
4 NO 0.8421052631578947?
...
Here the first column is the "Instance_number" we mentioned, the second column is just "Predictedpep", the fourth column is "Bank-new.arff" in the original "Pep" value (Here is "?"). Missing value). The third column is the confidence level of the forecast result (confidence). For example 0, we have 75% of the certainty that its "pep" value will be "yes", for instance 4 we have 84.2% of the certainty that its "pep" value will be "NO".
As we can see, there are at least two advantages to using a command line. One is the model can be saved, so that when the new data to be predicted to appear, do not have to re-model each time, directly apply the saved model. The other is to give confidence to the results of the predictions, we can choose to adopt the prediction results, for example, only those who have confidence in more than 85% of the results.
Unfortunately, the command line still cannot save models that have been selected in such ways as cross-validation, nor can they be applied to the data that you want to predict. For this purpose, the "Predictionappender" of the "Knowledgeflow" module must be used.

----Organized fromHttp://maya.cs.depaul.edu/~classes/ect584/WEKA/classify.html



4. Cluster analysis

Principle and realization
The "Class" (cluster) in cluster analysis is different from the "class" in the preceding category, and a more accurate translation of cluster should be "cluster". The task of clustering is to assign all instances to a number of clusters, so that instances of the same cluster are clustered around a cluster center, and the distances between them are relatively close, while the distances between different cluster instances are far away. For instances characterized by numeric attributes, this distance usually refers to Euclidean distance.
We now use the most common K-mean (K-means) algorithm for cluster analysis of the "bank data" in front of us. Let's briefly describe the steps for K-mean clustering.
The K-mean algorithm first randomly assigns a K-cluster center. Then: 1) Each instance is assigned to the cluster center closest to it, and the K clusters are obtained, and the mean values of all instances in each cluster are calculated separately as the new cluster centers. Repeat 1) and 2) until the location of the K-cluster center is fixed, and the cluster is assigned.

The K-mean algorithm above can only handle numeric properties, and when it encounters a property of a type, it changes to a number of properties that take values 0 and 1. Weka will automatically implement this type-to-numeric transformation, and Weka will automatically standardize the numerical data. Therefore, for the raw data "bank-data.csv", we do preprocessing just delete the attribute "id", Save as Arff format, modify the property "children" as the classification type. This gets the data file as "Bank.arff ", with 600 examples. 

Use "explorer" to open the "Bank.arff" you just got and switch to "Cluster". Click the "Choose" button to select "Simplekmeans", which is the algorithm to implement K-means in Weka. Click the next text box, modify "Numclusters" to 6, indicating that we want to put these 600 instances into 6 classes, that is, k=6. The following "seed" parameter is to set a random seed, which produces a random number, which is used to get the location of the K-cluster center given in the K-means algorithm for the first time. We might as well make it 10 for the time being. 
Select "Use training set" in "Cluster Mode" and click "Start" button to see the cluster result given by "clusterer output" on the right. You can also right-click on the result from the "result list" in the lower left corner and "View in separate window" to browse the results in a new pane. 

Results explained
First we notice that there is a line in the result:
Within cluster sum of squared errors:1604.7416693522332
This is the criterion for evaluating the quality of clustering, the smaller the number, the smaller the distance between instances of the same cluster. You may get a different value, but if you change the "seed" parameter, the resulting value may be different. We should try a few more seed and adopt the result with the smallest number. For example, if I let seed take 100, I get
Within cluster sum of squared errors:1555.6241507629218
I should take this back. Of course, try a few more seed, which may be smaller.

The next "Cluster centroids:" Lists the locations of each cluster center. For a numeric attribute, the cluster Center is its mean (Mean), and the number of types is its majority (Mode), which means that the attribute has the most value of the number of instances. For the properties of the numerical type, the standard deviation (Std Devs) in each cluster is also given.

The last "Clustered Instances" is the number and percentage of instances in each cluster.

To observe the visual clustering results, right-click on the results listed in the lower left "result list" and click "Visualize Cluster Assignments". A pop-up window gives a scatter plot of each instance. The top two boxes are select horizontal and ordinate, the second row of "color" is the basis for scatter plot coloring, by default, according to different clusters "Cluster" to the instance of different colors.
You can save the cluster result as a Arff file here by clicking "Save". In this new Arff file, the "Instance_number" attribute represents the number of an instance, and the "Cluster" attribute represents the cluster in which the instance is given by the clustering algorithm.

Weka Algorithm Introduction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.