Algorithm of--k nearest neighbor algorithm for data mining ten algorithms

Source: Internet
Author: User

The K-Nearest neighbor algorithm is the most basic of the case-based learning methods, and first introduces the related concepts of case-based learning.

first, the case-based learning.

1, a series of training examples are known, many learning methods for the objective function to establish a clear generalized description; But unlike this, an instance-based learning approach simply stores the training sample.

The generalization work from these instances is deferred until a new instance must be categorized. Whenever the learner encounters a new query instance, it analyzes the relationship of the new instance to the previously stored instance and assigns a target function value to the new instance.

2, the instance-based method can establish different target function approximation for different query instances to be classified. In fact, many techniques only establish a local approximation of the target function, apply it to an instance adjacent to the new query instance, and never build a good approximation across the entire instance space. This has a significant advantage when the objective function is complex, but it can be described with less complex local approximation.

3, based on the shortcomings of the example method:

(1) The cost of classifying a new instance can be significant. This is because almost all calculations occur at the time of classification, not when the training sample is first encountered. Therefore, how to effectively index training samples to reduce the need to calculate the query is an important practical problem.

(2) When a similar training sample is retrieved from memory, they generally consider all the properties of the instance. If the target concept relies on only a few of the many attributes, then the real most "similar" instances are likely to be far apart. Second, K-nearest neighbor Method

The most basic of the case-based learning method is the K-nearest neighbor algorithm. This algorithm assumes that all instances correspond to the points in the N dimensional Euclidean space ân. The nearest neighbor of an instance is defined by the standard Euclidean distance. More precisely, the arbitrary instance X is represented as the following eigenvector:

<A1 (x), A2 (x), ..., an (x) >

where AR (x) represents the value of the R attribute of instance x. Then the distance between the two instance XI and XJ is defined as D (XI,XJ), Where:

Description

1, in the nearest neighbor learning, the target function value can be a discrete value can also be a real value.

2, we first consider the following forms of discrete objective functions. where V is the finite set {V1,...vs}. The K-Nearest neighbor algorithm for approximating discrete objective functions is presented in the table below.

3. As indicated in the table below, the return value F ' (XQ) of this algorithm is an estimate of f (XQ), which is the most common F-value in the K training samples closest to XQ.

4. If we choose K=1, then the "1-Nearest neighbor algorithm" assigns F (xi) to (XQ), where Xi is the training instance closest to XQ. For a larger K-value, the algorithm returns the most common F-value in the first k nearest training instance.

K-Nearest neighbor algorithm for approximation of discrete-valued function F:ân_v

Training algorithm:

For each training sample <x,f (x), add the sample to the list Training_examples

Classification algorithm:

Given a query instance to classify XQ

In the Training_examples, the nearest XQ is selected as the K-instance, which is represented by x1....xk

Return

Where if A=b is so D (A, B) = 1, otherwise d (A, B) = 0.

The following diagram illustrates a simple case of the K-nearest neighbor algorithm, where the instance is a point in a two-dimensional space, and the target function has a Boolean value. The positive and negative training examples are represented by "+" and "-" respectively. The graph also draws a query point xq. Note In this picture, the 1-nearest neighbor algorithm classifies XQ as a positive example, whereas the 5-nearest neighbor algorithm classifies XQ as a counter-example.

Illustration: The left picture shows a series of positive and negative training examples and a query instance to be classified xq. The 1-nearest neighbor algorithm classifies XQ as a positive example, whereas the 5-nearest neighbor algorithm classifies XQ as a counter-example.

The image on the right is for a typical training sample set of 1-neighbor algorithm resulting in the decision surface. The convex polygon around each of the training samples represents the instance space closest to that point (that is, the instance in this space is given the classification that the training sample belongs to by the 1-nearest neighbor algorithm).

After making a simple modification to the previous K-nearest neighbor algorithm, it can be used to approximate the continuous value of the target function. To achieve this, we let the algorithm calculate the average of K closest to the sample, rather than calculating the most common values. More precisely, in order to approximate a real-valued objective function, we simply replace the formula in the algorithm with:

three, distance weighted nearest neighbor algorithm

An obvious improvement to the K-nearest neighbor algorithm is the weighting of the contribution of XQ, which assigns the larger weights to the nearest neighbor based on their relative distance to the query point.

For example, in the algorithm that approximates the discrete objective function in the table above, we can weighted the "suffrage" of the nearest neighbor according to the reciprocal of the squared distance of each neighbor and XQ.

The method is implemented by replacing the formula in the above table algorithm with the following:

which

In order to handle the query point XQ exactly match a training sample XI, which results in a denominator of 0, we make F ' (XQ) equal to F (xi) in this case. If there are multiple examples of such training, we use the classification of the majority of them.

We can also use the distance weighting of real-valued objective functions in a similar way, as long as the formula of the above table is replaced by:

Where the definition of WI is the same as in the previous formula.

Note that the denominator in this formula is a constant that normalized the contribution of different weights (for example, it guarantees that (XQ) <--c) If all training samples Xi,f (xi) =c.

Note that all variants of the K-nearest neighbor algorithm only consider K nearest neighbors to classify the query points. If you use a distance weighting, then allowing all the training samples to affect the classification of XQ is actually no disadvantage, because very far-away instances have little effect on (XQ). The only disadvantage of considering all the examples is that the classifications will run slower. If you consider all the training samples when classifying a new query instance, we call this global method. If only the nearest training sample is considered, we call this local method.

Iv. Description of the K-nearest neighbor algorithm

The K-nearest neighbor algorithm based on distance weighting is a very effective inductive reasoning method. It is very robust to the noise in the training data, and it is very effective when given a large enough set of training. Note that by taking the weighted average of k nearest neighbors, the effects of isolated noise samples can be eliminated.

1. Problem one: The distance between neighbors will be dominated by a large number of irrelevant attributes.

A practical problem with applying K-nearest neighbor algorithm is that the distance between instances is calculated based on all the attributes of the instance (that is, all axes that contain the Euclidean space of the instance). This differs from a method that selects only a subset of all instance properties, such as a decision tree learning system.

For example, there is a problem: each instance is described by 20 attributes, but only 2 of these attributes are related to its classification. In this case, the consistent values of the two related properties may be far apart in this 20-dimensional instance space. As a result, a similarity metric that relies on these 20 attributes can mislead the classification of the K-nearest neighbor algorithm. The distance between neighbors is dominated by a large number of irrelevant attributes. This is sometimes referred to as a dimension disaster (Curse of dimensionality) because of the difficulties caused by the existence of many unrelated attributes. The nearest neighbor method is particularly sensitive to this problem.

2. Workaround: Weighting each attribute when calculating the distance between two instances.

This is equivalent to proportionally scaling the axes in the Euclidean space, shortening the axes that correspond to the less relevant properties, and lengthening the axes that correspond to the more relevant properties. The number of stretches per axis can be determined automatically by cross-validation methods.

3. Problem two: Another practical problem of applying K-nearest neighbor algorithm is how to establish efficient indexes. Because this algorithm defers all processing until a new query is received, processing each new query may require a lot of computation.

4. Workaround: Many methods have been developed to index the stored training samples to determine the nearest neighbor more efficiently with a certain amount of storage overhead. An index method is Kd-tree (Bentley 1975;friedman et al. 1977), which stores instances within the tree's leaf nodes, and adjacent instances are stored in the same or nearby nodes. By testing the selected properties of the new query xq, the internal nodes of the tree arrange the query xq to the relevant leaf nodes.


Python and Java version Code tutorials and downloads


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.