Machine Learning 01-KNN Proximity algorithm

Last Update:2018-01-15 Source: Internet

Author: User

Tags ranges

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

K-Nearest Neighbor algorithm

Overview: K-Nearest neighbor algorithm is used to classify the distance between different eigenvalue values
Advantages: High precision, insensitive to outliers. No data input assumptions
Disadvantage: High computational complexity, high spatial complexity, and it has no way to the basic data of some internal information data.
Algorithm Description: There is an accurate sample of the data set. Called a training sample set, each item in the sample collection comes with its own category label. When it is necessary to infer the classification of new data, it is only necessary to calculate the characteristic data and the most similar classification label in the sample data, select K most similar label, the most of the K tag is the target tag.

Detailed classification algorithm

    #-*-coding=utf-8-*-     fromNumPyImport*Importoperator# #简单的kNN算法实现    #dataSet是训练数据集合. Each row represents each characteristic value of each training data    #labels the class label of each training data for the corresponding dataset    #inX representing characteristic data to be categorized     def classify0(InX, DataSet, labels, k):Datasetsize = dataset.shape[0]# Get test set size        #求每一个输入特征值和每一个測试集合总的特征值的超时        #首先须要使用tile将特征值扩展为和測试集合相等大小的矩阵Diffmat = Tile (InX, (Datasetsize,1))-DataSet#取平方Sqldiffmat = Diffmat * *2Summat = Sqldiffmat.sum (axis=1) Distances = Summat * *0.5        #获取排序信息        #比如: Array ([9,1,3,0]), arrays ([3,1,2,0]) ascending labelSortindicies = Distances.argsort () ClassCount = {}#取距离最小的前k个相应的标签统计信息         forIinchRange (k): label = Labels[sortindicies[i]] Classcount[label] = classcount.get (label,0) +1        #取最大的Sortedclasscount = sorted (Classcount.iteritems (), key = Operator.itemgetter (1), reverse=True)returnsortedclasscount[0][0]

About feature data

Different characteristics, the detailed data value fluctuation interval is different, such as the feature a value range in [1000,10000], but the value range of the feature B in [0,10], assuming that the use of this feature data directly to the KNN algorithm operation. One problem that arises is that the characteristics of high intervals are far greater than the eigenvalues of the lower interval, so we need to do a normalization of our characteristic data, and all the eigenvalues will be processed into the same range.

Detailed algorithm: ((Eigenvalue-min)/(Max-min))--[0,1] interval range

     fromNumPyImport*Importoperatorbetween #用于将一个不同范围域的特征值归一化到统一的 [0,1]     def normdata(dataSet):        #获取每一个特征的最大值MaxValue = Dataset.max (0)#获取每一个特征的最小值MinValue = Dataset.min (0) Ranges=maxvalue-minvalue#将数据归一到同一个范围Normaldataset = Zeros (Shape (dataSet)) m = dataset.shape[0] Normaldataset = Dataset-tile (ranges, (M,1))#除于最大值Normaldataset = Normaldataset/tile (MaxValue, (M,1))returnNormaldataset, Ranges, minvalues

About visualizing feature data

How to distinguish the characteristic data set we obtained is suitable for using KNN for classification training?
When we do data observation, we often need to visualize the distribution of our feature data and label, and this time we need to use a Python graphical tool matplotlib.
Features and classification data: TestSet.txt
3.542485 1.977398-1
3.018896 2.556416-1
7.551510-1.580030 1
2.114999-0.004466-1
8.127113 1.274372 1
7.108772-0.986906 1
8.610639 2.046708 1
2.326297 0.265213-1
3.634009 1.730537-1
0.341367-0.894998-1
3.125951 0.293251-1
2.123252-0.783563-1
0.887835-2.797792-1
7.139979-2.329896 1
1.696414-1.212496-1
8.117032 0.623493 1
8.497162-0.266649 1
4.658191 3.507396-1
8.197181 1.545132 1
1.208047 0.213100-1
1.928486-0.321870-1
2.175808-0.014527-1
7.886608 0.461755 1
3.223038-0.552392-1
3.628502 2.190585-1
7.407860-0.121961 1
7.286357 0.251077 1

Visual script:

 fromNumPy Import *import Matplotlibimport matplotlib.pyplot asPlt# #read FileFR =Open(' TestSet.txt ')Lines= Fr.readlines () DataSet = Zeros ((Len(Lines),1)) labels = []index =0 for  Line inch Lines:Items= Line. Strip ().Split(' \ t ') Dataset[index:] =Items[0:2] Labels.append (Items[-1])#matplotFX = plt.figure () ax = Fx.add_subplot (111)#将数组转换为矩阵DataSet = Matrix (DataSet) Colora = Tile ( -,Len(Lines))#这里的colora是为了通过颜色区分不同的labels, CMAP represents the color map, the default is yard, S is the size of each point, alpha is the transparency of each pointAx.scatter (dataset[:,0], dataset[:,1], C=colora * labels, cmap=' Autum ', s= -, alpha=0.3) Plt.show ()

Machine Learning 01-KNN Proximity algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More