Data analysis using Go machine learning Libraries Authoring 1 (KNN)

Source: Internet
Author: User
This is a creation in Article, where the information may have evolved or changed.

Catalogue [−]

    1. Iris Data Set
    2. KNN k Nearest Neighbor algorithm
    3. Training data and Forecasts
    4. Evaluation
    5. Python Code implementation

This series of articles describes how to use the Go language for data analysis and machine learning.

Go Machine Learning Library is not a lot, the function of the sea is not rich in python, hope in the next few years to have more features to enrich the library interview.

This article uses the Golearn library to analyze the iris data set using the KNN method.

Iris Data Set

Iris datasets are also called iris data sets, or they are called Fisher Iris Floral Datasets or the Anderson Iris Floral DataSet. is a data set for a class of multivariate analysis. It was originally Edgar Anderson from the genus Iris from the Gaspar Peninsula in Canada and was used in statistics by Ronald Fei as an example of discriminant analysis.

Other more popular datasets are Adult,wine,car evaluation ET (1).

The iris DataSet contains 150 samples, all of which belong to the three sub-genera of Iris, the iris (setosa), the iris (versicolor), and the Virginia Iris (virginica). The four characteristics are used as a quantitative analysis of the samples, respectively, of the length and width of the calyx and petals . Based on the collection of these four features, Fisher developed a linear discriminant analysis to determine its genus.

Here are the three kinds of iris flowers, very beautiful:







is a scatter diagram of the iris data set, the first species is linearly separable with two other kinds, the latter two kinds are nonlinear separable:

The above content mainly refers to the introduction of the iris data set by Wikipedia and Baidu Encyclopedia.

This data set is easy to find on the web and can also be downloaded from the Golearn project.

KNN k Nearest Neighbor algorithm

Classification algorithm is one of the simplest methods in data mining classification technology. The so-called K nearest neighbor is the meaning of K's closest neighbour, saying that each sample can be represented by its nearest K-neighbor.

The core idea of the KNN algorithm is that if the majority of the k nearest samples in a feature space belong to a category, the sample also falls into this category and has the characteristics of the sample on this category. This method determines the category to which the sample is to be divided, depending on the category of one or more adjacent samples in determining the classification decision. The KNN method is only associated with a very small number of adjacent samples when deciding on a class. The KNN method is more suitable than other methods because the KNN method mainly relies on the surrounding finite sample, rather than the Discriminant class domain method to determine the category of the class.

Simply put, if you live in an upscale neighborhood, surrounded by "high-end population", then you can be judged to be "high-end population" and then not be ...

kRepresents a nearest k neighbor.

More detailed introduction of the algorithm can be consulted: Baidu Encyclopedia and Wikipedia.

Training data and Forecasts

Let's look at an example of Golearn using the KNN algorithm to analyze the iris data set.

12345678910111213141516171819202122232425262728293031323334353637
 PackageMainImport("FMT""Github.com/sjwhitworth/golearn/base""Github.com/sjwhitworth/golearn/evaluation""GITHUB.COM/SJWHITWORTH/GOLEARN/KNN")funcMain () {rawdata, err: = base. Parsecsvtoinstances (".. /datasets/iris_headers.csv ",true)ifErr! =Nil{Panic(ERR)}//initialises A new KNN classifierCLS: = KNN. Newknnclassifier ("Euclidean","Linear",2)//do a training-test splitTraindata, TestData: = base. Instancestraintestsplit (RawData,0.50) CLS. Fit (Traindata)//calculates the Euclidean distance and returns the most popular labelPredictions, err: = CLS. Predict (TestData)ifErr! =Nil{Panic(ERR)} Fmt. PRINTLN (predictions)//Prints Precision/recall metricsConfusionmat, err: = evaluation. Getconfusionmatrix (testData, predictions)ifErr! =Nil{Panic(FMT. Sprintf ("Unable to get confusion matrix:%s"Err. Error ()))}fmt. PRINTLN (evaluation. Getsummary (Confusionmat))}

#12 Line loads the iris data set, Base provides a way to read the CSV text file.

#18 Line creates a KNN classifier, the distance is calculated using the Euclidean method, in addition to the support manhattan , cosine distance calculation method. The second parameter supports a linear and kdtree .

#18 has also been specified K as 2.

#21 The iris data set is divided into two parts according to the parameters, it is compared with this parameter using the random number, so the result of the divided data is roughly this proportion. Part of the training data, part for testing. Then #22 start training data.

#25 Use the test forecast data and print out the forecast results.

#32 ~ #36 is to evaluate the predictive model and output the evaluation results.

Evaluation

First look at the evaluation results

12345
Reference classtrue positivesfalse positivestrue NegativesPrecisionRecallF1 Score---------------------------------------------------------------------------------Iris-setosa 0  - 1.0000 1.0000 1.0000 Iris-virginica10.96430.93100.9474 Iris-versicolor20.93330.96550.9492

Here are a few concepts to explain.

    • Confusionmatrix: The confusion Matrix, which depicts the relationship between the real attribute of sample data and the type of recognition result, is a common method for evaluating classifier performance and is used for supervised learning.
    • True positives: True, TP, positive sample predicted by the model, can be called true rate of judgment
    • False positives: false positives, FP, false positives, negative samples predicted as positive by the model; can be called false alarm rate
    • True negatives: True negative, TN, negative sample that is predicted as negative by the model; it can be called false accuracy
    • False negatives: False negative, FN, false negatives, positive samples predicted as negative by the model; can be called false negatives
    • Precision: accuracy, accurately predicting the number of positive or negative numbers, which represents how many of the samples that are predicted to be positive are real positive samples. $ $P = \frac{tp}{tp+fp}$$,
    • Recall: Recall rate, which represents how much of the sample is predicted correctly, $ $R = TPR = \frac{tp}{tp+fn}$$
    • F1 score: In order to evaluate the merits and demerits of different algorithms, the concept of F1 value is put forward on the basis of precision and recall, and the overall evaluation of precision and recall is made. F1 is defined as follows:F1值 = 正确率 * 召回率 * 2 / (正确率 + 召回率)

Python Code implementation

It sklearn is easy to implement the above logic.

123456789101112131415161718192021
 from  sklearn import  neighbors, datasets, metrics< span class= "keyword" >from  sklearn.model_selection import  train_test_split # import some data to play with  IRIS = Datasets.load_iris () # prepare datax_train, X_test, y_train, y_test = Train_test_split (Iris.data, Iris.target, Test_size=0.5 , Random_state=0 ) # we create an instance of Neighbours Classifier and fit the data.  KNN = neighbors. Kneighborsclassifier (N_neighbors=2 , Weights= ' distance ' ) Knn.fit (X_train, Y_train) # make prediction  predicted = knn.predict (x_test) # evaluate  print (Metrics.classification_report (y_test, predicted)) print (Metrics.confusion_matrix ( Y_test, predicted)) 
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.