machine Learning Combat (Chapter II: K-Nearest neighbor algorithm)
Today I studied the second chapter, here I understand to do a simple summary, is to deepen my understanding and in my own language to describe the algorithm.
Distance Calculation
Computation of Euclidean distance based on vector space. (L2 distance)
In particular, LP distance (L1 distance) can be used for distance.
The simple point is that in a large sample concentration, each instance has 3 or more feature attributes, one of which is necessarily a categorical attribute, and the remainder is a numeric attribute (even a nominal property can be transformed by some means), each of which is a vector of attribute eigenvalues. A sample set is composed of multiple vectors.
For example, the following example
Height |
Weight |
Age |
Gender |
170 |
140 |
22 |
Man |
160 |
100 |
21st |
Woman |
"Gender" can be viewed as a categorical attribute, and then others look at feature attributes, forming an instance vector for [170,140,22] and [160,100,21]
first, the algorithm steps:
1, calculate the distance between the point in the dataset of the known category and the current point;
2, according to the order of increasing the distance;
3. Select the k point with the minimum distance from the current point;
4, to determine the frequency of the category K points , K is used to select the number of nearest neighbor, K choice is very sensitive. The smaller the K value means the higher the complexity of the model, thus, it is easy to produce the fitting, the larger K value means the whole model becomes simple, the approximate error of learning increases, in the practical application, a relatively small k value is generally used, and a cross validation method is used to select an optimal K value. )
5, return the first K points appear the highest frequency category as the current point of the forecast classification
Second, example description (Python) First, according to the algorithm to write a KNN algorithm class
'''
InX: Input vectors for classification
DataSet: Training Sample Set
Labels: label vector
K: Used to select the number of nearest neighbors
'''
def classify0 (inx,dataset,labels,k):
#距离计算
Datasetsize = dataset.shape[0]
Diffmat = Tile (InX, (datasetsize,1))-dataset
Sqdiffmat = diffmat**2
Sqdistances = Sqdiffmat.sum (Axis=1)
distances = sqdistances**0.5
Sorteddistindicies = Distances.argsort () #按从小到大的排序后的索引值
#选择距离最小的k个点
ClassCount = {}
For I in range (k):
Voteilabel = Labels[sorteddistindicies[i]]
Classcount[voteilabel] = Classcount.get (voteilabel,0) +1
#排序
Sortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=true)
return sortedclasscount[0][0]
Give an example description (handwriting recognition data)
Data Description:
We need to transform each sample data into vector format, because the picture is 32x32 txt text format exists, we turn into 1x1024 vector, and all the training samples are saved as a matrix, each corresponding row is each instance.
#把一个32X32的变成1X1024的向量
def img2vector (filename):
Returnvec = Zeros ((1,1024))
fr = open (filename)
for I in range (k):
linestr = Fr.readline () for
J in range (k):
returnvec[0,32*i+j] = Int (linestr[j))
Return Returnvec
Using the KNN algorithm class that we have written in advance, we train in the format and calculate the error rate.
Def handwritingclasstest (): Hwlabels = [] #获取训练集 trainingfilelist = Os.listdir (' digits//trainingdigits ') #获取该目 Record all file name types: List m = Len (trainingfilelist) Trainingmat = Zeros ((m,1024)) for I in Range (m): Filenamestr = Trainingfilelist[i] #7_173. txt filestr = filenamestr.split ('. ') [0] # ' 7_173 ' txt ' classnumstr = int (Filestr.split ('_') [0]) #7 173 hwlabels.append (CLA SSNUMSTR) trainingmat[i,:] = Img2vector ("digits//trainingdigits//" +filenamestr) #获取测试集 testfilelist = OS.L Istdir ("digits//testdigits") errorcount = 0 mtest = Len (testfilelist) for I in Range (mtest): FileNameS TR = testfilelist[i] filestr = Filenamestr.split ('. ') [0] # ' 7_173 ' txt ' classnumstr = int (Filestr.split ('_') [0]) # 7 173 vectorundertest = Img2vector ("Digi ts//testdigits//"+filenamestr" Classifierresult = classify0 (vectorundertest,trainingmat,hwlabels,3) PRINT ("The classifier came back with:%d,the real answer is:%d"% (CLASSIFIERRESULT,CLASSNUMSTR)) if Classifierresult != Classnumstr:errorcount + + 1 print "The total number of errors is:%d"%errorcount print " Error rate is:%f "% (Errorcount/float (mtest))
Summary:
Advantages
1. High precision
2. Not sensitive to abnormal values
3. No distribution assumptions about the data
Disadvantages
1, the KNN algorithm does not have a training process like other algorithms
2, the KNN algorithm for those classification of uneven training samples may be large error
3, the amount of calculation is too large, each test sample to be tested to traverse the training samples to calculate the distance
4, we do not know what the example samples and typical sample samples have what characteristics, can not give any data infrastructure information