Overview of the K- neighbor algorithm
the K-nearest algorithm is classified by measuring the distance between different eigenvalue values.
Advantages: High accuracy, insensitive to outliers, no data input assumptions
Cons: High computational complexity, high spatial complexity
Use data range: Numeric and nominal
How it works : There is a collection of sample data (also known as a training sample set), and each data in the sample set has a label, that is, we know the corresponding relationship between each data in the sample set and the owning one. After entering new data without a label, each feature of the new data is compared to the feature in the sample set, and the algorithm extracts the category labels of the most similar data (nearest neighbor) in the sample set. In general, the first K most similar data in a sample dataset is selected, which is the source of K in the K- nearest neighbor algorithm , usually k is not much more than the whole number. Finally, select the most frequently occurring classification of the K Most similar data as the classification of the new data.
K- Neighbor Algorithm Code analysis:
For each point in the dataset of the Unknown category property, do the following:
(1) Calculate the distance between the point in the data set of the known category and the current point;
(2) Sorting in ascending order of distance;
(3) Select k points with the minimum distance from the current point;
(4) Determine The frequency of occurrence of the category of the first k points;
(5) Returns the category with the highest frequency of the first K points as the predicted classification of the current point.
The code is as follows:
1 defclassify0 (InX, DataSet, labels, k):2Datasetsize =Dataset.shape[0]3Diffmat = Tile (InX, (datasetsize,1))-DataSet4Sqdiffmat = diffmat**25Sqdistances = Sqdiffmat.sum (Axis=1)6distances = sqdistances**0.57Sorteddistindicies =Distances.argsort ()8Classcount={} 9 forIinchRange (k):TenVoteilabel =Labels[sorteddistindicies[i]] OneClasscount[voteilabel] = Classcount.get (voteilabel,0) + 1 ASortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=True) - returnSORTEDCLASSCOUNT[0][0]
Code annotations:
"1"shape[0] calculates the number of rows of the matrix,shape[1] calculates the number of columns of the matrix
"2"tile array InX to datasetsize rows 1 column repeats, for example: IntX to be [0, 0] , you Tile after calculation
[0, 0]
[0, 0]
[0, 0]
[0, 0]
... datasetsize line
3 " **  refers to the sub-party, DIFFMAT**2  diffmat For example Span style= "Font-family:times New Roman;" >[1, 2]**2 = [1, 4]
"4"sqdiffmat.sum (axis=1) refers to the and of each row of elements in an array, and these and then form a Array :
Example: >>>a = Array ([[[1, 2], [2, 4]])
>>>s = A.sum (Axis=1)
>>>s
Array ([3, 6])
>>>a = Array ([[1, 2, 3], [2, 3, 4]])
>>>s = A.sum (Axis=1)
>>>s
Array ([6, 9])
However, if the array has only one row, such as Array ([1, 2]), then sum (axis=1) cannot be used, only sum ( )
"5" ClassCount = {}Create a newDict,Dictprovided byGetmethod, ifKeydoes not exist, can returnNone, or the one you specifyvalue, hereclasscount.get (voteilabel, 0)means there is no relativeKeyValue ofvalueThe return0
For example: >>> d = {' Michael ': Up, ' Bob ': +, ' Tracy ': 85}
>>> d[' Michael ']
95
>>> d[' Thomas '
Traceback (most recent):
File "<stdin>", line 1, in <module>
Keyerror: ' Thomas '
To avoid a key that does not exist, there are two ways to determine whether a key exists by using in:
>>> ' Thomas ' in D
False
The second is the get method provided by Dict , if key does not exist, you can return None, or your own specified value:
>>> d.get (' Thomas ')
>>> d.get (' Thomas ',-1)
-1
"6"sorted () by classcount dictionary 2 Elements (that is, the number of occurrences of a category) from large to small
Test the code to run the effect:
knn.py File:
1 fromNumPyImport*2 Importoperator3 defclassify0 (InX, DataSet, labels, k):4Datasetsize =Dataset.shape[0]5Diffmat = Tile (InX, (datasetsize,1))-DataSet6Sqdiffmat = diffmat**27Sqdistances = Sqdiffmat.sum (Axis=1)8distances = sqdistances**0.59Sorteddistindicies =Distances.argsort ()TenClasscount={} One forIinchRange (k): AVoteilabel =Labels[sorteddistindicies[i]] -Classcount[voteilabel] = Classcount.get (voteilabel,0) + 1 -Sortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=True) the returnSortedclasscount[0][0] - - defCreateDataSet (): -Group = Array ([[[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) +Labels = ['A','A','B','B'] - returnGroup, labels
Machine learning Note (ii)--k-nearest neighbor algorithm