K-Nearest Neighbor algorithm
Overview: K-Nearest neighbor algorithm is used to classify the distance between different eigenvalue values
Advantages: High precision, insensitive to outliers. No data input assumptions
Disadvantage: High computational complexity, high spatial complexity, and it has no way to the basic data of some internal information data.
Algorithm Description: There is an accurate sample of the data set. Called a training sample set, each item in the sample collection comes with its own category label. When it is necessary to infer the classification of new data, it is only necessary to calculate the characteristic data and the most similar classification label in the sample data, select K most similar label, the most of the K tag is the target tag.
Detailed classification algorithm
#-*-coding=utf-8-*- fromNumPyImport*Importoperator# #简单的kNN算法实现 #dataSet是训练数据集合. Each row represents each characteristic value of each training data #labels the class label of each training data for the corresponding dataset #inX representing characteristic data to be categorized def classify0(InX, DataSet, labels, k):Datasetsize = dataset.shape[0]# Get test set size #求每一个输入特征值和每一个測试集合总的特征值的超时 #首先须要使用tile将特征值扩展为和測试集合相等大小的矩阵Diffmat = Tile (InX, (Datasetsize,1))-DataSet#取平方Sqldiffmat = Diffmat * *2Summat = Sqldiffmat.sum (axis=1) Distances = Summat * *0.5 #获取排序信息 #比如: Array ([9,1,3,0]), arrays ([3,1,2,0]) ascending labelSortindicies = Distances.argsort () ClassCount = {}#取距离最小的前k个相应的标签统计信息 forIinchRange (k): label = Labels[sortindicies[i]] Classcount[label] = classcount.get (label,0) +1 #取最大的Sortedclasscount = sorted (Classcount.iteritems (), key = Operator.itemgetter (1), reverse=True)returnsortedclasscount[0][0]
About feature data
Different characteristics, the detailed data value fluctuation interval is different, such as the feature a value range in [1000,10000], but the value range of the feature B in [0,10], assuming that the use of this feature data directly to the KNN algorithm operation. One problem that arises is that the characteristics of high intervals are far greater than the eigenvalues of the lower interval, so we need to do a normalization of our characteristic data, and all the eigenvalues will be processed into the same range.
Detailed algorithm: ((Eigenvalue-min)/(Max-min))--[0,1] interval range
fromNumPyImport*Importoperatorbetween #用于将一个不同范围域的特征值归一化到统一的 [0,1] def normdata(dataSet): #获取每一个特征的最大值MaxValue = Dataset.max (0)#获取每一个特征的最小值MinValue = Dataset.min (0) Ranges=maxvalue-minvalue#将数据归一到同一个范围Normaldataset = Zeros (Shape (dataSet)) m = dataset.shape[0] Normaldataset = Dataset-tile (ranges, (M,1))#除于最大值Normaldataset = Normaldataset/tile (MaxValue, (M,1))returnNormaldataset, Ranges, minvalues
About visualizing feature data
How to distinguish the characteristic data set we obtained is suitable for using KNN for classification training?
When we do data observation, we often need to visualize the distribution of our feature data and label, and this time we need to use a Python graphical tool matplotlib.
Features and classification data: TestSet.txt
3.542485 1.977398-1
3.018896 2.556416-1
7.551510-1.580030 1
2.114999-0.004466-1
8.127113 1.274372 1
7.108772-0.986906 1
8.610639 2.046708 1
2.326297 0.265213-1
3.634009 1.730537-1
0.341367-0.894998-1
3.125951 0.293251-1
2.123252-0.783563-1
0.887835-2.797792-1
7.139979-2.329896 1
1.696414-1.212496-1
8.117032 0.623493 1
8.497162-0.266649 1
4.658191 3.507396-1
8.197181 1.545132 1
1.208047 0.213100-1
1.928486-0.321870-1
2.175808-0.014527-1
7.886608 0.461755 1
3.223038-0.552392-1
3.628502 2.190585-1
7.407860-0.121961 1
7.286357 0.251077 1
Visual script:
fromNumPy Import *import Matplotlibimport matplotlib.pyplot asPlt# #read FileFR =Open(' TestSet.txt ')Lines= Fr.readlines () DataSet = Zeros ((Len(Lines),1)) labels = []index =0 for Line inch Lines:Items= Line. Strip ().Split(' \ t ') Dataset[index:] =Items[0:2] Labels.append (Items[-1])#matplotFX = plt.figure () ax = Fx.add_subplot (111)#将数组转换为矩阵DataSet = Matrix (DataSet) Colora = Tile ( -,Len(Lines))#这里的colora是为了通过颜色区分不同的labels, CMAP represents the color map, the default is yard, S is the size of each point, alpha is the transparency of each pointAx.scatter (dataset[:,0], dataset[:,1], C=colora * labels, cmap=' Autum ', s= -, alpha=0.3) Plt.show ()
Machine Learning 01-KNN Proximity algorithm