Experimental purposes
Recently intend to systematically start learning machine learning, bought a few books, but also find a lot of practicing things, this series is a record of their learning process, from the most basic KNN algorithm began; experiment Introduction
Language: Python
GitHub Address: LUUUYI/KNN
Experiment Step 1) Principle Introduction
K-Nearest Neighbor algorithm is a basic classification and regression method. K-Nearest Neighbor algorithm: that is, given a training dataset, the new input instance, in the training dataset to find the nearest to the example of the K-example, the K-example of the majority of the class, the input instance into this class.
For example, let's look at the following figure:
If k=3, the nearest 3 dots of green dots are 2 red small triangles and a small blue square, few from the majority, based on the statistical method, the decision green to classify points belong to a red triangle class.
If k=5, the nearest 5 neighbors of the green dot are 2 red triangles and 3 blue squares, or a few from the majority, based on the statistical method, the green is judged to be a blue square for this classification point. That is to say, this is the most basic idea of a minority that obeys the majority, but it can also be seen that when k values are different, the results for green dots are different. So in practical application, for a precise classification purpose, how to select a K value, and determine what distance is the standard to divide, is the K nearest neighbor algorithm is the most important part of the core.
2) Simple implementation
Look at the following two functions:
Def createtmpdata ():
group = Array ([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
labels = [' A ', ' a ', ' B ', ' B ']
Return group, Labels
def classify0 (in_data,datas,labels,k):
datas_height = datas.shape[0]
Diff_array = Tile (in_data,[datas_height,1])-datas
diff_array_power2 = diff_array**2 Distance_array
= Diff_array_ Power2.sum (axis=1)
Distance_array = distance_array**0.5
Sorted_index = Distance_array.argsort ()
Labels_dict = {} for
I in range (k):
label = labels[sorted_index[i]]
Labels_dict[label] = Labels_dict.get (label,0) + 1
sorted_datas = sorted (Labels_dict.iteritems (), Key=operator.itemgetter (1), reverse=true)
return sorted_datas[0][0]
The first function simply generates a temporary test sample, the second function is the core of a streamlined KNN algorithm, input to be tested data and sample data sets, and the most important k value, where the distance is calculated as a Euclidean distance, in simple terms is the former square root (x squared plus squared), The main purpose here is to simplify the understanding and implementation of the algorithm, only to select the two-dimensional characteristics of the sample, but in the actual application of the object may be far more than two-dimensional characteristics. 3) Experimental Application
After the simple implementation of the algorithm, we get an appointment information dataset, the dataset consists mainly of three characters (the annual flight distance, the percentage of time spent on entertainment per day, the amount of ice cream consumed each year), and the person's label (which is worth dating, general, not worth dating):
For data processing, you can read the memory before preprocessing, for example, the above data, you can see that in the annual flight distance this feature, its order of magnitude is the other two characteristics of the explosion, that is, the effect of the feature on the results of the other two characteristics, which does not conform to the concept of KNN algorithm, So after reading the data, you need to do a normalized treatment:
def autonormal (datas):
height = datas.shape[0]
max_array = Datas.max (0)
Min_array = datas.min (0)
Range_array = Max_array-min_array
normaled_data = (Datas-tile (Min_array, (height,1)))/tile (Range_array, (height, 1) return
normaled_data, Range_array, Min_array
After processing, thanks to the powerful library of Python language, we can visualize the data using the Matplotlib library (two of them selected):
def drawImage (datas,labels):
fig = plt.figure ()
ax = Fig.add_subplot ($)
Ax.scatter (datas[:,1],datas[: , 2],5.0*array (labels), 15.0*array (labels))
plt.show ()
Three kinds of colors are representative of dislike, general, like, horizontal ordinate is one of the two characteristics respectively.
Combined with the implementation of the core algorithm just now, we choose 90% data to generate the data model, and then use the remaining 10% to test, the final statistical model of the correct classification ability, that is, the error rate for how much, the evaluation method chosen here is 0-1 loss, the test error is absolute error rate.
Def datingclasstest ():
datas, labels = loaddatafromfile (' datingTestSet2.txt ')
height = datas.shape[0]
Ratio = 0.1
test_nums = Int (0.1*height)
Normaled_datas, range_array, Min_array = Autonormal (datas)
Test_ Datas = Normaled_datas[0:test_nums,:]
error_count = 0.0 for
i in range (test_nums):
Result_label = classify0 (test_datas[i],normaled_datas[test_nums:,:],labels[test_nums:],3)
Print "The class result are:%d, the real answer is%d"% (Result_label,labels[i])
if (Result_label!= labels[i)):
error_count + + 1.0
print "The total error rate is:%f" (Error_count/int (0.1*height))
After the execution, you can look at the output:
5 of the error rate, the ability to correctly classify has exceeded 90%, but also acceptable, constantly adjust the parameter K can observe the difference of error rate, finally choose the most appropriate parameters.
After you manually enter the parameters to evaluate:
Def datingclassperson ():
result_labels = [' Not as ', ' a little like ', ' like ']
Percents_of_play = float (raw_ Input (' Enter your Percents_of_play: ')
fly_distance = float (raw_input (' Enter your: '))
ice = float (Raw_input (' Enter your ice: ')
datas, labels = loaddatafromfile (' datingTestSet2.txt ')
Normaled_datas, range_array, Min_array = Autonormal (datas)
test_data = Array ([Fly_distance,percents_of_play,ice])
Classed_label = Classify0 ((test_data-min_array)/ range_array,normaled_datas,labels,3)
print "The result is:%s"% (Result_labels[classed_label-1])
The results are:
4) Matters needing attention
How should we choose the K value of K nearest neighbor?
Take a look at this picture below:
There are two classes in the picture above, one is a black dot, the other is a blue rectangle, and now our point of classification is red Pentagon.
If you choose K=1, we can see that the most recent black sample from the red sample, according to the method of KNN calculation, the samples are classified directly into the black dataset; therefore, when the K value is very small, we can easily learn the noise, it is very easy to determine the noise category; When the K value is the middle value, At the moment most of the samples in the range are blue, and the samples are classified as blue, but the increase in K does not mean that the model becomes reliable when k is N, which is k=n:
Can feel this time is the end result of the classification point is a very simple sample set of the maximum number of samples, such classification results are not reliable.
Therefore, in the KNN algorithm, for K is worth choosing, usually take the cross validation method to select the best K value, repeated use of the sample data, the sample data after slicing the training verification.
Reference:
Understand K nearest neighbor (K-NN) algorithm (i)