recently in the "machine learning Combat" in the study of some basic algorithms, for a pure novice I also found on the Internet to write information, the following on the book I see Plus on other blog content to do a summary, blog please refer to http://www.cnblogs.com/ Baiyishaonian/p/4567446.htmlK-Nearest Neighbor algorithm
The K-Nearest neighbor algorithm is used to measure the distance between different eigenvalues to classify.
Advantages: high precision, insensitive to outliers, no data input assumptions.
Disadvantages: High computational complexity and high spatial complexity.
applicable range: numerical and nominal type.
Working principle:
There is a collection of sample data, also known as a training sample set, and each data in the sample set has a label, that is, we know the corresponding relationship between each data in the sample set and the owning classification, and after entering new data with no tags, each feature of the new data is compared with the characteristics of the sample set data. Then the algorithm extracts the classification label of the most similar data in the sample set. In general, we only select the first k most similar data in the sample data set, which is the origin of the K-nearest neighbor algorithm name.
Importing data using Python
From the working principle of K-Nearest neighbor algorithm, we can see that in order to implement this algorithm to classify data, we need sample data on hand, no sample data how to set up the classification function. So, our first step is to import the sample data collection.
Create a module named knn.py and write the code:
1 from numpy Import * Import Operator3 4 def createdataset (): 5 group = Array ([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) 6
labels = [' A ', ' a ', ' B ', ' B ']7 return group, labels
Code, we need to import two modules of Python: Scientific Computing package NumPy and operator modules. The NumPy function library is a standalone module in the Python development environment, and most Python versions do not have a default installation of the NumPy library, so here we need to install the module separately.
We create datasets in the CreateDataSet () function, Group and label lable as training samples, according to the working principle, each data in the dataset has a label, labels contains the number of elements equal to the number of rows of the group matrix. Here we define the data point (1,1.1) as Class A, and the data point (0,0.1) is defined as Class B. The data in the example is arbitrarily selected and does not give the axis coordinates.
Implementation of K-nearest neighbor algorithm
The specific idea of the K-nearest neighbor algorithm is as follows:
(1) Calculate the distance between the points in a well-known category dataset and the current point
(2) Sorting in ascending order of distance
(3) Select K points with a minimum distance from the current point
(4) Determine the frequency of the category in which the first K points are present
(5) Returns the category with the highest frequency in the first K points as the current point of the forecast classification
The code for the Python language implementation of the K-nearest neighbor algorithm is as follows:
1 # coding:utf-8 2 3 from numpy import * 4 import operator 5 import KNN 6 7 Group, labels = knn.createdatase T () 8 9 def classify (InX, dataSet, labels, k): ten datasetsize = dataset.shape[0] one diffmat = Tile (InX, (d atasetsize,1))-dataSet12 Sqdiffmat = diffmat**213 sqdistances = sqdiffmat.sum (Axis=1) distances = sqdistances**0.515 sorteddistances = Distances.argsort () ClassCount = {}17 for I in range (k): 18 Numoflabel = labels[sorteddistances[i]]19 Classcount[numoflabel] = Classcount.get (numoflabel,0) + 120 Sortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=true) return sortedclasscount[0][0]22 my = classify ([0,0], group, labels, 3) print my
The results of the operation are as follows:
The output is B: It shows that our new data ([0,0]) belongs to Class B.
Code explanation
I believe a lot of friends have a lot to do with this code, and then I'll focus on some of the key points of this function to make it easier for readers and myself to review the algorithm code.
Parameters of the classify function:
- InX: Input vectors for classification
- DataSet: Training Sample Collection
- Labels: tag vector
- K in the k:k-nearest neighbor algorithm
Shape: Is the property of the array that describes the dimension of a multidimensional array
Tile (InX, (datasetsize,1)): InX Two-dimensional array, datasetsize represents the number of rows after the array is generated, and 1 represents a multiple of the column. The entire line of code indicates that each element of the previous two-dimensional array matrix is subtracted from the corresponding element value of the latter array, thus eliminating the subtraction between the matrices.
Axis=1: When the parameter equals 1, the sum of the numbers between rows in the matrix is equal to 0, which represents the sum of the numbers between columns.
Argsort (): Sort an array in a non-descending order
Classcount.get (numoflabel,0) + 1:get (): This method is a method of accessing a dictionary item, that is, accessing an item with a subscript key of Numoflabel, and if not, the initial value is 0. Then add 1 to the value of this item. So it's simple and efficient to do this in Python with just one line of code.
Machine learning Algorithm one: K-Nearest neighbor algorithm