This is a creation in Article, where the information may have evolved or changed.
Catalogue [−]
- Iris Data Set
- KNN k Nearest Neighbor algorithm
- Training data and Forecasts
- Evaluation
- Python Code implementation
This series of articles describes how to use the Go language for data analysis and machine learning.
Go Machine Learning Library is not a lot, the function of the sea is not rich in python, hope in the next few years to have more features to enrich the library interview.
This article uses the Golearn library to analyze the iris data set using the KNN method.
Iris Data Set
Iris datasets are also called iris data sets, or they are called Fisher Iris Floral Datasets or the Anderson Iris Floral DataSet. is a data set for a class of multivariate analysis. It was originally Edgar Anderson from the genus Iris from the Gaspar Peninsula in Canada and was used in statistics by Ronald Fei as an example of discriminant analysis.
Other more popular datasets are Adult,wine,car evaluation ET (1).
The iris DataSet contains 150 samples, all of which belong to the three sub-genera of Iris, the iris (setosa), the iris (versicolor), and the Virginia Iris (virginica). The four characteristics are used as a quantitative analysis of the samples, respectively, of the length and width of the calyx and petals . Based on the collection of these four features, Fisher developed a linear discriminant analysis to determine its genus.
Here are the three kinds of iris flowers, very beautiful:
is a scatter diagram of the iris data set, the first species is linearly separable with two other kinds, the latter two kinds are nonlinear separable:
The above content mainly refers to the introduction of the iris data set by Wikipedia and Baidu Encyclopedia.
This data set is easy to find on the web and can also be downloaded from the Golearn project.
KNN k Nearest Neighbor algorithm
Classification algorithm is one of the simplest methods in data mining classification technology. The so-called K nearest neighbor is the meaning of K's closest neighbour, saying that each sample can be represented by its nearest K-neighbor.
The core idea of the KNN algorithm is that if the majority of the k nearest samples in a feature space belong to a category, the sample also falls into this category and has the characteristics of the sample on this category. This method determines the category to which the sample is to be divided, depending on the category of one or more adjacent samples in determining the classification decision. The KNN method is only associated with a very small number of adjacent samples when deciding on a class. The KNN method is more suitable than other methods because the KNN method mainly relies on the surrounding finite sample, rather than the Discriminant class domain method to determine the category of the class.
Simply put, if you live in an upscale neighborhood, surrounded by "high-end population", then you can be judged to be "high-end population" and then not be ...
k
Represents a nearest k
neighbor.
More detailed introduction of the algorithm can be consulted: Baidu Encyclopedia and Wikipedia.
Training data and Forecasts
Let's look at an example of Golearn using the KNN algorithm to analyze the iris data set.
12345678910111213141516171819202122232425262728293031323334353637 |
PackageMainImport("FMT""Github.com/sjwhitworth/golearn/base""Github.com/sjwhitworth/golearn/evaluation""GITHUB.COM/SJWHITWORTH/GOLEARN/KNN")funcMain () {rawdata, err: = base. Parsecsvtoinstances (".. /datasets/iris_headers.csv ",true)ifErr! =Nil{Panic(ERR)}//initialises A new KNN classifierCLS: = KNN. Newknnclassifier ("Euclidean","Linear",2)//do a training-test splitTraindata, TestData: = base. Instancestraintestsplit (RawData,0.50) CLS. Fit (Traindata)//calculates the Euclidean distance and returns the most popular labelPredictions, err: = CLS. Predict (TestData)ifErr! =Nil{Panic(ERR)} Fmt. PRINTLN (predictions)//Prints Precision/recall metricsConfusionmat, err: = evaluation. Getconfusionmatrix (testData, predictions)ifErr! =Nil{Panic(FMT. Sprintf ("Unable to get confusion matrix:%s"Err. Error ()))}fmt. PRINTLN (evaluation. Getsummary (Confusionmat))} |
#12 Line loads the iris data set, Base provides a way to read the CSV text file.
#18 Line creates a KNN classifier, the distance is calculated using the Euclidean method, in addition to the support manhattan
, cosine
distance calculation method. The second parameter supports a linear
and kdtree
.
#18 has also been specified K
as 2.
#21 The iris data set is divided into two parts according to the parameters, it is compared with this parameter using the random number, so the result of the divided data is roughly this proportion. Part of the training data, part for testing. Then #22 start training data.
#25 Use the test forecast data and print out the forecast results.
#32 ~ #36 is to evaluate the predictive model and output the evaluation results.
Evaluation
First look at the evaluation results
12345 |
Reference classtrue positivesfalse positivestrue NegativesPrecisionRecallF1 Score---------------------------------------------------------------------------------Iris-setosa 0 - 1.0000 1.0000 1.0000 Iris-virginica10.96430.93100.9474 Iris-versicolor20.93330.96550.9492 |
Here are a few concepts to explain.
- Confusionmatrix: The confusion Matrix, which depicts the relationship between the real attribute of sample data and the type of recognition result, is a common method for evaluating classifier performance and is used for supervised learning.
- True positives: True, TP, positive sample predicted by the model, can be called true rate of judgment
- False positives: false positives, FP, false positives, negative samples predicted as positive by the model; can be called false alarm rate
- True negatives: True negative, TN, negative sample that is predicted as negative by the model; it can be called false accuracy
- False negatives: False negative, FN, false negatives, positive samples predicted as negative by the model; can be called false negatives
- Precision: accuracy, accurately predicting the number of positive or negative numbers, which represents how many of the samples that are predicted to be positive are real positive samples. $ $P = \frac{tp}{tp+fp}$$,
- Recall: Recall rate, which represents how much of the sample is predicted correctly, $ $R = TPR = \frac{tp}{tp+fn}$$
- F1 score: In order to evaluate the merits and demerits of different algorithms, the concept of F1 value is put forward on the basis of precision and recall, and the overall evaluation of precision and recall is made. F1 is defined as follows:
F1值 = 正确率 * 召回率 * 2 / (正确率 + 召回率)
Python Code implementation
It sklearn
is easy to implement the above logic.
123456789101112131415161718192021 |
from sklearn import neighbors, datasets, metrics< span class= "keyword" >from sklearn.model_selection import train_test_split # import some data to play with IRIS = Datasets.load_iris () # prepare datax_train, X_test, y_train, y_test = Train_test_split (Iris.data, Iris.target, Test_size=0.5 , Random_state=0 ) # we create an instance of Neighbours Classifier and fit the data. KNN = neighbors. Kneighborsclassifier (N_neighbors=2 , Weights= ' distance ' ) Knn.fit (X_train, Y_train) # make prediction predicted = knn.predict (x_test) # evaluate print (Metrics.classification_report (y_test, predicted)) print (Metrics.confusion_matrix ( Y_test, predicted))
|