Python algorithm walkthrough-One Rule algorithm, pythonrule
In this way, a feature has only 0 and 1 values, and the dataset has three categories. If Category A has 20 such individuals, Category B has 60 such individuals, and category C has 20 such individuals. Therefore, when this feature is set to 0, Class B is the most likely. However, there are still 40 individuals not in Class B. Therefore, the error rate of dividing this feature from 0 to Class B is 40%. Then, all features are counted, all feature error rates are calculated, and features with the lowest error rate are selected as the unique classification criterion-this is OneR.
Now we use code to implement algorithms.
# OneR Algorithm Implementation import numpy as npfrom sklearn. datasets import load_iris # load the iris dataset = load_iris () # load the data array (feature of the dataset) in the iris dataset X = dataset. data # load the target array (Category of the dataset) in the iris dataset y_true = dataset.tar get # calculate the average value of each feature attribute_means = X. mean (axis = 0) # Compare with the average value. If the value is greater than or equal to "1", the smaller value is "0 ". change the continuous feature value to a discrete class type. X = np. array (X> = attribute_means, dtype = "int") from sklearn. model_selection import partition, x_test, y_train, y_test = train_test_split (x, y_true, random_state = 14) from operator import itemgetterfrom collections import defaultdict # locate the category of different values under a feature. Def evaluate (x, y_true, feature_index, feature_values): num_class = defaultdict (int) for sample, y in zip (x, y_true): if sample [feature_index] = feature_values: num_class [y] + = 1 # Sort To find the largest category. Sort sorted_num_class = sorted (num_class.items (), key = itemgetter (1), reverse = True) in ascending order) most_frequent_class = sorted_num_class [0] [0] error = sum (value_num for class_num, value_num in sorted_num_class if class_num! = Most_frequent_class) return most_frequent_class, error # print train_feature_class (x_train, y_train, 0, 1) # define a function with the feature as the independent variable to find the best feature with the lowest error rate, and the category of each feature value under this feature. Def train_feature (x, y_true, feature_index): n_sample, n_feature = x. shape assert 0 <= feature_index <n_feature value = set (x [:, feature_index]) predictors = {} errors = [] for current_value in value: most_frequent_class, error = cursor (x, y_true, feature_index, current_value) predictors [current_value] = most_frequent_class errors. append (error) total_error = sum (errors) return predict Ors, total_error # locate the class of each feature value under all features. The format is {0 :( {0: 0, 1: 2}, 41)}. First, it is a dictionary, the dictionary key is a feature. The dictionary value is composed of a set, which is composed of a dictionary and a value. The dictionary key is the feature value and the dictionary value is a category, the last value is the error rate. All_predictors = {feature: train_feature (x_train, y_train, feature) for feature in xrange (x_train.shape [1])} # print all_predictors # filter out the error rate of each feature. errors = {feature: error for feature, (mapping, error) in all_predictors.items ()} # Sort the error rate, obtain the optimal features and the lowest error rate. This is the one Rule (OneR) algorithm. Best_feature, best_error = sorted (errors. items (), key = itemgetter (1), reverse = False) [0] # print "The best model is based on feature {0} and has error {1 :. 2f }". format (best_feature, best_error) # print all_predictors [best_feature] [0] # create model = {"feature": best_feature, "predictor ": all_predictors [best_feature] [0]} # print model # start the test-classify the feature values under the optimal feature. Def predict (x_test, model): feature = model ["feature"] predictor = model ["predictor"] y_predictor = np. array ([predictor [int (sample [feature])] for sample in x_test]) return y_predictory_predictor = predict (x_test, model) # print y_predictor # Under this optimal feature, the classification of each feature value is compared with the test dataset to obtain the accuracy. Accuracy = np. mean (y_predictor = y_test) * 100 print "The test accuracy is {0 :. 2f} % ". format (accuracy) from sklearn. metrics import classification_report # print (classification_report (y_test, y_predictor ))
Conclusion: In the OneR algorithm, I initially thought it was a feature with the lowest error rate that can be used to determine the classification of all features,
In fact, it is clear that it can only judge the classification of feature values under this feature, so obviously it has some limitations. Just say
It is quick and simple. However, you still have to determine whether to use it.
Class precision recall f1-score support
0 0.94 1.00 0.97 17
1 0.00 0.00 0.00 13
2 0.40 1.00 0.57 8
Avg/total 0.51 0.66 0.55 38
Note:
# In the above Code.
For sample in x_test:
Print sample [0]
# Obtain the first column of x_test data. The following code is used to obtain the first row of x_test data.
Print x_test [0]
# Note the differences between the two