The K-Nearest neighbor algorithm for machine learning

Source: Internet
Author: User

The first article in the blog park, but will not be the last article. Although the name of machine learning sounds like a bluff, we know that every seemingly professional noun is used to make a small white one. So for those seemingly professional nouns, we need to understand what they are talking about, perhaps this is what I have been pursuing the spirit of hacker.

The K-Nearest neighbor algorithm is a relatively simple algorithm in machine learning, but it is not as simple as small white, but rather simpler than other algorithms for machine learning. First of all, the essence of the K-nearest neighbor algorithm: Compare the test sample with the training sample set the distance between each sample, and then according to the order of distance increment, down to select the first k and test sample distance compared to the sample of the category label, and then select the K tags, the most frequently occurring category, This category is the K-nearest neighbor algorithm based on the training sample set given by the test sample prediction category.

In the above essentials of the elaboration, there are several problems.

    1. How to calculate the distance between the sample in the test sample and the training sample set;
    2. What is required for the training sample set;
    3. Does ascending by distance sort the set of training samples? or what to sort;
    4. Each test sample needs to be calculated from each sample in the training sample set, which is not a large amount of computation;
    5. How the value k is determined;

Discussion of the corresponding question:

  1. In view of the above 4 questions, we have to consider each one, perhaps this is the best way to learn. First, consider the 1th question, how the distance between samples is calculated. Before considering the calculation, it is not a joke to make sure that the sample is exactly what the structure is, or if you want to calculate the distance from the east where you do not know what is at all. The sample mentioned here is processed data, if the specific format, the sample mentioned here is processed into an n-dimensional vector, the value of each dimension represents a sample of the properties. In this case, we will think of the method of calculating the distance between two vectors in geometry. Yes, it is the Euclidean distance calculation method (will have two vectors V1 and the distance between v2,v1=<a1,a2>,v2=<b1,b2>,v1 and V2 =). So the key here is that the sample data must be formatted as a vector, where the category tag attribute does not need to be placed in the vector, and the class label of the entire training set can be stored in a different list in the order of the sample in the train set.
  2. As already known from the above discussion, the samples in the training sample must all be processed into the numeric vector format, which is the requirement for calculating distances later. On the other hand, it can be seen from the essence of the algorithm that the quality of the training sample set basically determines the algorithm because it is the distance from the sample set in the training sample to predict the classification of the test sample. If there is a problem with the training sample set, then the prediction will not be correct. Therefore, before the algorithm starts, the quality of the training sample set must be ensured. The quality here includes the size of the sample set, and whether it is close to the actual data and many other aspects.
  3. From the previous discussion, we found that it would be unwise to sort the training sample set if the number of samples in the sample set was large, which would consume a lot of computational resources. So we can consider indexing the sample set, that is, not moving the set of training samples, just the size of the eradication distance, using the index of the training sample set instead of the sample itself to sort. This python has very good support.
  4. Needless to say, the size of the computation increases linearly with the increase in the set of training samples, and the predictions for each test sample require such a calculation. And as the training sample set increases, so does the storage space.
  5. The choice of the value K is subtle, as the difference in K may result in a different error rate for the classifier, so the best way to do this is to test multiple times to select a better k value.

Let's take a look at some examples of implementations using Python:

If you create a dataset like this

 1  def   CreateDataSet ():  2  group = Array ([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1  3  labels = [ '     

The group variable is an array in NumPy, 3 rows and 2 columns, representing 3 samples or 3 2-dimensional numeric vectors. The labels list stores the category labels that correspond to the samples in the training sample set group. This allows us to get a set of training samples and a list of the corresponding category tags.

If we now have a test vector a=[1.0,1.0], we need to predict a category for vector a based on the set of training samples, does a belong to category A or does it belong to B? The following is the implementation of the K-Nearest neighbor algorithm:

1 defCLASSIFYKNN (InX, Trainingdataset, Traininglabels, K):2Datasetsize =Trainingdataset.shape[0]3Diffmat = Tile (InX, (datasetsize,1))-Trainingdataset4Sqdiffmat = diffmat**25Sqdistances = Sqdiffmat.sum (Axis=1)6distances = sqdistances**0.57Sorteddistindicies = Distances.argsort ()#returns the indexed result of a sort8ClassCount = {}9      forIinchRange (k):TenVotedlabel =Traininglabels[sorteddistindicies[i]] OneClasscount[votedlabel] = Classcount.get (votedlabel,0) + 1 ASortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=True) -     returnSORTEDCLASSCOUNT[0][0]

Then enter in the Python shell

1 classifyknn ([1.0,1.0],group,labels,3)

Gets the output result as

1 ' A '

Everything seems so natural and simple, but the reality is far from being so simple. We also need to consider how we can translate actual data into vector representations in real-world problems. And we can see from the European-style distance calculation formula, if one of the values of the value range is much larger than the other range of values, then the value of the large range of properties will have a great impact on the distance, so how to think that each property is equally important, Then you need to continue processing the properties of these different ranges of values. Also, some sample properties are nominal, not numerical, so what should be done?

Don't be impatient, please look forward to the next article to discuss how to apply K-nearest neighbor algorithm in practice.

The K-Nearest neighbor algorithm for machine learning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.