Machine Learning 01-KNN Proximity algorithm

Source: Internet
Author: User
Tags ranges

K-Nearest Neighbor algorithm

Overview: K-Nearest neighbor algorithm is used to classify the distance between different eigenvalue values
Advantages: High precision, insensitive to outliers. No data input assumptions
Disadvantage: High computational complexity, high spatial complexity, and it has no way to the basic data of some internal information data.
Algorithm Description: There is an accurate sample of the data set. Called a training sample set, each item in the sample collection comes with its own category label. When it is necessary to infer the classification of new data, it is only necessary to calculate the characteristic data and the most similar classification label in the sample data, select K most similar label, the most of the K tag is the target tag.

Detailed classification algorithm
    #-*-coding=utf-8-*-     fromNumPyImport*Importoperator# #简单的kNN算法实现    #dataSet是训练数据集合. Each row represents each characteristic value of each training data    #labels the class label of each training data for the corresponding dataset    #inX representing characteristic data to be categorized     def classify0(InX, DataSet, labels, k):Datasetsize = dataset.shape[0]# Get test set size        #求每一个输入特征值和每一个測试集合总的特征值的超时        #首先须要使用tile将特征值扩展为和測试集合相等大小的矩阵Diffmat = Tile (InX, (Datasetsize,1))-DataSet#取平方Sqldiffmat = Diffmat * *2Summat = Sqldiffmat.sum (axis=1) Distances = Summat * *0.5        #获取排序信息        #比如: Array ([9,1,3,0]), arrays ([3,1,2,0]) ascending labelSortindicies = Distances.argsort () ClassCount = {}#取距离最小的前k个相应的标签统计信息         forIinchRange (k): label = Labels[sortindicies[i]] Classcount[label] = classcount.get (label,0) +1        #取最大的Sortedclasscount = sorted (Classcount.iteritems (), key = Operator.itemgetter (1), reverse=True)returnsortedclasscount[0][0]
About feature data

Different characteristics, the detailed data value fluctuation interval is different, such as the feature a value range in [1000,10000], but the value range of the feature B in [0,10], assuming that the use of this feature data directly to the KNN algorithm operation. One problem that arises is that the characteristics of high intervals are far greater than the eigenvalues of the lower interval, so we need to do a normalization of our characteristic data, and all the eigenvalues will be processed into the same range.


Detailed algorithm: ((Eigenvalue-min)/(Max-min))--[0,1] interval range

     fromNumPyImport*Importoperatorbetween #用于将一个不同范围域的特征值归一化到统一的 [0,1]     def normdata(dataSet):        #获取每一个特征的最大值MaxValue = Dataset.max (0)#获取每一个特征的最小值MinValue = Dataset.min (0) Ranges=maxvalue-minvalue#将数据归一到同一个范围Normaldataset = Zeros (Shape (dataSet)) m = dataset.shape[0] Normaldataset = Dataset-tile (ranges, (M,1))#除于最大值Normaldataset = Normaldataset/tile (MaxValue, (M,1))returnNormaldataset, Ranges, minvalues
About visualizing feature data

How to distinguish the characteristic data set we obtained is suitable for using KNN for classification training?
When we do data observation, we often need to visualize the distribution of our feature data and label, and this time we need to use a Python graphical tool matplotlib.
Features and classification data: TestSet.txt
3.542485 1.977398-1
3.018896 2.556416-1
7.551510-1.580030 1
2.114999-0.004466-1
8.127113 1.274372 1
7.108772-0.986906 1
8.610639 2.046708 1
2.326297 0.265213-1
3.634009 1.730537-1
0.341367-0.894998-1
3.125951 0.293251-1
2.123252-0.783563-1
0.887835-2.797792-1
7.139979-2.329896 1
1.696414-1.212496-1
8.117032 0.623493 1
8.497162-0.266649 1
4.658191 3.507396-1
8.197181 1.545132 1
1.208047 0.213100-1
1.928486-0.321870-1
2.175808-0.014527-1
7.886608 0.461755 1
3.223038-0.552392-1
3.628502 2.190585-1
7.407860-0.121961 1
7.286357 0.251077 1

Visual script:

 fromNumPy Import *import Matplotlibimport matplotlib.pyplot asPlt# #read FileFR =Open(' TestSet.txt ')Lines= Fr.readlines () DataSet = Zeros ((Len(Lines),1)) labels = []index =0 for  Line inch Lines:Items= Line. Strip ().Split(' \ t ') Dataset[index:] =Items[0:2] Labels.append (Items[-1])#matplotFX = plt.figure () ax = Fx.add_subplot (111)#将数组转换为矩阵DataSet = Matrix (DataSet) Colora = Tile ( -,Len(Lines))#这里的colora是为了通过颜色区分不同的labels, CMAP represents the color map, the default is yard, S is the size of each point, alpha is the transparency of each pointAx.scatter (dataset[:,0], dataset[:,1], C=colora * labels, cmap=' Autum ', s= -, alpha=0.3) Plt.show ()

Machine Learning 01-KNN Proximity algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.