KNN (nearest neighbor algorithm)

Source: Internet
Author: User

KNN is one of the simplest machine learning algorithms.

In pattern recognition, thek -Nearest neighbor algorithm (or short name of nearest neighbor) is a non-parametric method for classification and regression. [1] In both cases, the input contains K The most recent training samples in the feature space. the output depends on whether the nearest neighbor is used for classification or regression:

L in the Knn classification, the output is a categorical relationship. An object is categorized by its neighbor vote, and is classified to the nearest K -Nearest neighbor (K is a positive integer, usually small). If k = 1, then the object is only assigned to a single nearest neighbor.

L in KNN regression, output the attribute value of the object, which is the average of the K-neighbors.

K-nn is an instance-based learning, or lazy learning, where the function is approximate to the local and all computations are deferred to the classification. the K-nn algorithm is one of the simplest machine learning algorithms. the K-nn algorithm is one of the simplest machine learning algorithms .

For classification and regression, it can contribute weights to the nearest neighbor according to the classification, so that nearby neighbors help on average more distant. For example, a common weighting scheme is to 1/d each neighbor , where D is the distance from the neighbor.

A neighbor is a class (K-nn ) from a set of objects, or an object's property value (k-nn regression) is known. This can be seen as a training of algorithms, although no definite training steps are required.

One drawback is that the K-nn algorithm is sensitive to the local structure of the data. Cannot be confused with K- means , another popular machine learning technique.

Algorithm

An example of a KNN classification, the test sample (green garden) should be categorized into which category, the Blue Square or the red triangle. If the k=3, (real coil) he will be divided into the second class, because there are two triangles but only a square inside the circle. If k=5, (dashed circle) he will be divided into Category 1 (within the circle there are 3 squares, 2 triangles).

A training sample is a set of vectors in a multidimensional space. Each sample has a classification category. The values in the algorithm training phase include feature vectors and classification categories for training samples. In the classification phase, K is a user-defined constant, and a set of non-labeled vectors (query or test point) specifies the class with the highest frequency in the K samples closest to the query point.

A common type of Continuous Variables Distance measurement is Euclidean distance. TheTo discrete variables, such as text categorization, another metric can be used, such as Overlapping measures (or Hamming distance). In the context of microarray gene expression data, for example, in microarray gene expression data, the K-nearest neighbor also uses the correlation coefficients proposed by Pearson and Spearman. Generally, the accuracy of K-nearest neighbor classification using distance metrics or special algorithms can be significantly improved, such as large-margin nearest neighbor or neighborhood component analysis. One of the most basic drawbacks of the "majority vote" classification occurs when the class distribution is unbalanced. That is, the more frequent examples of classes tend to dominate the new examples of predictions, because they tend to be common to k nearest neighbors because they are a lot of numbers. One way to overcome this problem is to classify the weight, taking into account that the test distance points to each of its k nearest neighbors. Each k nearest point of a class (or a value in a regression problem) is multiplied by the weight proportional to the inverse of the distance from that point to the test point. Another way to overcome the imbalance is by the representation of the abstract data. For example, in a self-organizing map (SOM), each node is a representative (center) of the cluster's similarities, regardless of their density in the original training data. The K-NN can be applied to the SOM.

parameter Selection

The best choice for k depends on the data; In general, the greater the K value reduces the effect of noise on the classification, but makes the boundaries between classes not obvious. A good k can be chosen through various heuristics (see parameter optimization). The closest neighbor algorithm that is expected to be selected as the closest class for the training sample (that is, when k = 1) is called.

Property

K-Nearest neighbor is a special case that uses a uniform kernel function of variable bandwidth, kernel density "balloon" estimation. [8] [9]

The original version of the algorithm makes it easy to calculate the distance between test cases by calculating the examples of all storage, but for a large number of training samples it is computationally intensive. Using the appropriate nearest neighbor search algorithm makes the K-nearest neighbor more computationally suitable for large datasets. Over the years, a number of recent nearest neighbor search algorithms have been proposed, which often seek to reduce the number of actual executions of distance evaluations.

K-Neighbors have some strong consistency results. As the data volume tends to infinity, the algorithm guarantees that the yield error rate is not worse than two Bayesian error rates (the given data distribution can achieve the minimum error rate). K-Nearest neighbor is guaranteed to approach the Bayesian error rate (K increases the function through a data point entry) under some K values. By using close graphs, various improvements to K-nearest neighbors are possible.

Metric Learning

K-Nearest Neighbor classification performance can often be significantly improved through (supervised) metric learning. The popular algorithm is Neighbourhood components analysis and Large margin nearest neighbor. Monitor the label information used by the Metric learning algorithm to learn a new metric or pseudo metric.

Feature Extraction

When the input data and algorithm are too large to handle, it is suspected that there is redundancy (for example, measured with feet and meters) and then the input data becomes a condensed set of features (also known as feature vectors). The collection of input data into features is called feature extraction. If the extracted features are carefully selected, the expected feature set will extract relevant information from the input data to perform the required tasks, using a collection of this reduced feature without using full-size input. The K-Nearest neighbor algorithm is used to extract the original data before transforming the data in the feature space.

The example uses K-nearest neighbors, including face recognition, to process steps for feature extraction and size reduction in typical computer vision calculations (typically implemented using OpenCV):

    1. Haar Face Detection small wave facial detection
    2. Analysis of mean drift tracking of mean-shift tracking
    3. PCA or Fisher linear discriminant analysis projection to feature space, followed by K-nearest neighbor classification

Dimension reduction

For high-dimensional data (for example, the number of more than 10 dimensions) is usually reduced dimensionality to avoid the impact of dimension disaster on the K-nearest neighbor algorithm.

Dimension disaster in K-neighbor context basically means that Euclidean distance is a high dimension, because all vectors are almost equidistant search query vectors (imagine more or less lying on a circle, querying points centered on multiple points; the distance from all data points in the search space you query to is almost the same).

Feature extraction and dimensionality reduction can be combined using principal component Analysis (PCA), linear discriminant analysis (LDA), or canonical correlation (CCA) as a preprocessing step, followed by k-nn processing of the reduced-dimension eigenvector. In machine learning this process is also known as low-dimensional embedding.

Very high-dimensional datasets (for example, when performing a similarity search for video live, DNA data, or high-dimensional time series), run a fast approximate K-nearest-neighbor search using the local sensitive hashing algorithm (locality sensitive hashing) from the VLDB Toolbox, "random projections" (random projection), "Sketches" (sketches) or other high-dimensional similarity search techniques may be the only viable option.

Cluster boundary

The nearest neighbor rule actually implicitly calculates the bounds of the decision. It is also possible to explicitly compute the decision boundary, which is the complexity of the boundary.

Data reduction

Data reduction is one of the most important issues in working with large datasets. In general, only a subset of data points are required for accurate classification. These data are called prototypes and can be found in the following ways:

1. Class-Selection of outliers, training incorrect data classified by K-nearest neighbor (for a given k)

2. Divide the remaining data into two groups: (i) Some of the prototype points used to determine the classification (ii) classify the correct points through the K-NN prototype, and then delete the points.

Class-Selection of outliers

A training example that surrounds an instance of another class is called a class exception. Reasons for class exceptions include:

Random error

There are few training examples for this class (an isolated example, not a cluster)

L Lack of important characteristics (class separation in other aspects, we do not know)

L The class with a small amount of data will not hit a class with a large amount of data (unbalanced)

The anomaly generates noise for the K-nearest neighbor class. They can be detected and separated for future analysis. Given two natural numbers, K > R > 0, the training example is called the (k-r) NN class exception if its K nearest neighbor contains more than one example of r other classes.

CNN for Data reduction

The condensed nearest neighbor (CNN, Hartley algorithm) is designed to reduce the K-nearest neighbor classification algorithm data set. It selects the DataSet prototype U in the training data, so that 1NN uses U as accurately categorized as 1NN using the entire dataset.

For the training set X,CNN to work in an iterative manner:

    1. Scans the elements in X, looks for its most recent prototype from U, and has a different label than X
    2. Remove X from X and add it to U
    3. Repeat the scan until no more prototypes are added to U.

Use the U category instead of X. A point that is not a prototype is called an "absorbing" point.

It is a valid scan that reduces the bounds ratio of the training example in turn. The bounds ratio x of the training example is defined as

A (x) = | |x '-y| | / || x-y| |

| |x-y| | is x to nearest the distance with different color example Y, | |x '-y| | is the distance from Y to the nearest point of the node whose most recent label is X.

Calculate boundary proportions

The boundary ratio is in the [0,1] interval, because | |x '-y| | Never exceed | |x-y| |. This sort takes precedence over the bounds of the class into the prototypes U set. Different labels than x points are called external to X. The calculated boundary ratio is shown in the figure on the right. Color of the data point label: The initial point is X and its label is red. The outer points are blue and green. Close to the outer point x is Y. The closest y red dot is x '. Border ratio (x) = | |x '-y| | /|| x-y| | is the initial point x of the property.

Here's a list of CNN numbers in a series of instructions. There are three categories (red, green, and blue). Figure 1: Initially there are 60 points in each class. Figure 2 shows the 1NN classification chart: All data used by each pixel classified by 1NN. Figure 3 shows the 5NN classification chart. The white area corresponds to a non-confidential region, where 5NN votes are tied (for example, if there is a blue dot between two green, two red and 5 nearest neighbors). Figure 4 shows the reduced data set. The cross is a neural network rule (3,2) of class outlier values (all of these instances belong to other classes); The square is the prototype and absorbs the empty circle of the point. The lower-left corner shows the class exception, the prototype, and the number of all three-class absorbed points. The number of prototypes varies from 15% to 20% in the different categories in this example. Figure 5 shows that the prototype 1NN classification chart is very similar to the initial data set.

Figure 1 Data Set 1NN classification chart

Diagram of 5NN classification chart reducing data

1NN classification based on tuple

K-Nearest Neighbor Regression

The K-Nearest neighbor regression algorithm is used to estimate continuous variables. Such an algorithm uses a weighted average of k nearest neighbors, the reciprocal weighting of distances between them. The algorithm is as follows:

1. Calculates the Euclidean or Markov distance from the sample of the query to the flag example.

2. The label of the example is determined by the distance increase.

3. Find the nearest neighbor, based on the RMSE heuristic optimal number k. This is done by using cross-validation.

4. The inverse distance weighted average is computed with the multivariate neighbor of K-nearest neighbor.

KNN (nearest neighbor algorithm)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.