K-Nearest Neighbor method (K-nearest neighbor, K-nn)

Source: Internet
Author: User

The nearest neighbor method (-nearest neighbor,-nn) is a basic classification method.

The neighbor method assumes that a data set is given, where the sample category is determined. When classifying, the new sample, according to the new example of the nearest Neighbor Training sample category, through a majority of votes and other ways to predict.

Therefore, the neighboring law does not have an explicit learning process. The selection of values, distance measurement and classification decision rules are the three basic elements of the nearest neighbor method.

The neighboring law was proposed by cover and Hart in 1968.

Given a training set and a training set, a total of samples, dimensions, represented by the first sample in the dataset, represented by a marker (category) vector, representing the mark of the first sample.

At this point we want to predict a test sample of the tag, using the nearest neighbor method, follow the following steps:

    1. Based on the selected distance metric, a sample of the nearest neighbor is found in the training set, and these samples are represented as collections
    2. In the marking according to the classification decision rule (e.g. majority vote):

Where, for the indicator function, that is, the value is 1, otherwise 0, for the training set of all the tag kind, for the first category of tags.

When it is 1 o'clock, the nearest neighbor method is called the nearest neighbor method. The category of the test sample is completed depending on the category of the sample that is closest to it.

Let's talk about the realization of the nearest neighbor law: tree.

When implementing the nearest neighbor method, the simplest method of implementation must be a linear scan (linear scan), when it is necessary to calculate the distance between the test sample and each training instance in the training set, which is very time-consuming when the training sets are large, which is not feasible.

To improve the efficiency of near-neighbor searches, consider using a special structure to store training data to reduce the number of distances to be calculated. There are many ways to do this, and here we introduce the tree method.

A tree is a tree-shaped data structure that stores instance points in a dimension space for quick retrieval. A tree is a binary tree that represents a division (partition) of a dimension space. The construction tree is equivalent to continuously dividing the dimension space with the super plane perpendicular to the axis to form a series of hyper-rectangular regions. Each node of the tree corresponds to a hyper-rectangular region of a dimension.

For a training set, there is a sample, dimension is dimension, the steps to construct the tree are as follows:

    1. Constructs a root node that corresponds to a hyper-rectangular area of all the samples contained.
    2. Select the first dimension as the axis, with the median of all the samples in the first dimension as the segmentation point, and the hyper-rectangular region corresponding to the root node into two sub-regions. The slice is implemented by a super-plane that passes through the split point and is perpendicular to the currently selected axis. This creates a left and right sub-node with a depth of 1 from the root node: all the samples in the left Dial hand node are less than the Shard point on the first dimension, and all the samples in the right sub-node have a value greater than the Shard point on the first dimension.
    3. Repeat the 2nd step, for the current depth of the node, select the first dimension is an axis, the process of slicing the same as the 2nd step. This creates a left and right sub-node with depth as a node of depth.
    4. Until two of the subregions stop when no instances exist, resulting in a tree's zoning.

As an example of a book, for a dataset, a construction tree, its feature space is divided as shown here:

The constructed tree is as follows:

Now we're going to use the tree for the nearest neighbor search, which is illustrated below:

where S is the test sample, with s first found the leaf node D containing S, and then to draw s as the center of the circle through point D, then the nearest neighbor must be inside the circle. Then return to the parent node in turn to see if the corresponding area intersects the circle, compare the distance between the nodes and the current nearest point, and update them to find the nearest neighbor.

K-Nearest Neighbor method (K-nearest neighbor, K-nn)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.