3.1k Proximity algorithm
Given a training data set, for a new input instance, find the instance of the nearest K in the training dataset, the majority of the K instances belong to a class, and the input instance of the bar is divided into this class.
Algorithm 3.1
Input: Training Data Set
Where Xi is the characteristic vector of the instance, Yi is the class of the instance,
Output: Class Y for instance X
(1) According to a given distance measurement, in the training set T to find the nearest x nearest to the K point, covering the K-point of the neighborhood of the X to do NK (X);
(2) in NK (x) According to the classification decision rule (if majority voting determines the category Y of x):
I for the indicator function, when YI=CJ, I is 1, otherwise I is 0.
The special case of the K proximity algorithm is the case of the k=1, called the nearest neighbor algorithm, for the input instance point x, the nearest neighbor algorithm sets the class of the dataset with the X nearest to the point of the X class.
The K neighbor algorithm does not show the learning process.
3.2K Proximity Model
3.2.1 Model
In the K proximity algorithm, when the training amount, distance metric, K value, and classification decision rules are determined, the category to which it belongs is uniquely determined for any new input instance.
In the feature space, for each training instance Point XI, all points closer to the store than the other stores form an area called the unit . The units of all training instance points constitute a division of the feature space. The nearest neighbor is the class Yi with instance XI as a class tag in its cell. This way, the class of instance points for each cell is deterministic.
3.2.2 Distance Measurement
Slightly
Selection of 3.2.3k values
k Too small is easy to fit, and too large means that the model becomes simpler. Usually love to cross-validate to choose the best K value
3.2.4 Classification decision Rules
A. Majority voting rules
If the loss function of the classification is 0-1 loss function
The classification function is
Then the probability of the mistaken classification is
Implementation of 3.3K proximity algorithm: KD tree
3.3.1 Construction KD Tree
algorithm 3.2 to construct KD balance tree
Input: k-dimensional spatial data set
which
Output: KD tree
(1) Start: Constructs the root node, which corresponds to the hyper-rectangular area of the K-dimensional space that contains T. Select as an axis to divide the super-rectangular region of the root node into two sub-regions by dividing the median of the coordinates of all instances in T as the segmentation point. The slice is implemented by a super-plane that is perpendicular to the axis by dividing the points.
The left and right nodes with a depth of 1 are generated by the root node: The left Dial hand node corresponds to a sub-region that is smaller than the Shard Point, and a sub-region with coordinates greater than the Shard point.
(2) Repeat: For the node with a depth of J, select the axis of the segmentation, L=j (MoD) k+1, in order to change the coordinates of all instances in the area of the nodes as a segmentation point, the corresponding super-giant region of the node is divided into two sub-regions. The slice is implemented by a super-plane that is perpendicular to the axis by dividing the points.
This node generates the left and right child nodes with a depth of j+1: The left Dial hand node is smaller than the sub-region of the split point, and the sub-region of the split point.
Saves an instance point that sits on the Shard's hyper-plane to that node.
(3) Knowing that there are no instances of two sub-regions stop when there is no instance, thus forming the division of KD tree.
The division of EG.
3.3.2 Search kd Tree
Given a target point, search for its nearest neighbor. The leaf node that contains the target point is first found, and then the leaf node is left to fall back to the parent node, not to find the node closest to the target point, and terminate when it is determined that a closer node cannot be found.
algorithm 3.3 nearest neighbor search with kd tree
Input: Constructed kd tree: Target point x;
Output: Nearest neighbor of X
(1) In the KD tree, find the leaf node containing the target point x: From the root node, recursively down to the number of KD. If the coordinates of the target point x current dimension are less than the Shard point
Coordinates, move to the left child node, or move to the right child node, knowing that the child node is a leaf node.
(2) with this leaf node as the current nearest point
(3) Recursive upward fallback, at each node to do the following:
A. If the point holds an instance point closer to the target point than the current closest point, the instance point is the current nearest point
B. The current nearest point must exist in the region corresponding to a child node of the node. Check that the area of the other child node of the lid node has a closer point. Specifically, check whether the region corresponding to the other child node intersects with a sphere that has a radius of the target point and the distance between the target point and the current nearest point. If intersecting, roll up
(4) When the root node is returned, the search ends. Last "current nearest point" is the nearest point of X
eg.
The search process is as follows: First, in the KD tree species find points containing the point s of the leaf node D, with D as the approximate nearest neighbor. The real nearest neighbor must be at the center of a circle that is centered through point D. It then returns the parent node of node D, searching for the nearest neighbor in the area of the other child node F of Junction B. The area of node F does not intersect with the circle, and it is impossible to have the nearest neighbor point. Continue to return to the upper parent Node A, search for the nearest neighbor in the area of the other child node C of a, the area of node C intersects with the circle, the region has an e at the instance point within the circle, and point E is closer to the point D and becomes the new nearest neighbor.
Add:
Cross-validation: The basic idea is to reuse the data, divide the given data into a set of training sets and test sets, and repeat the training, testing and model selection on this basis.
1. Simple cross-validation
Firstly, we randomly divide the data into two parts, one part as the training set, the other part as the test set, then train the model with the training set under various conditions (for example, the number of different parameters) to get different models, evaluate the test errors of each model on the test set, and select the model with the smallest test error.
2.S folded cross-validation
Most applications. The method is as follows: firstly, the data is randomly divided into a subset of s size, and then the data training model of S-1 subset is used to test the model with the remaining subset, and the process is repeated for the possible S selection, and the model of the average test error in S-sub-evaluation is selected.
3. Leave a cross-validation
The special case of S-fold cross-validation is s=n, which is called a cross-validation of leave-one-out (validation), often used in the absence of data.
K Proximity algorithm