Machine learning--k-Nearest neighbor (KNN) algorithm

Source: Internet
Author: User

first, the basic principle There is a collection of sample data (also called a training sample set), and there is a label for each data in the sample set. After entering new data without a label, each feature of the new data is compared to the feature in the sample set, and then the algorithm extracts the category label of the most similar data (nearest neighbor) in the sample set. We generally select the most similar data for the first K (k is usually not greater than 20) in the sample set, and finally select the most frequently occurring classification of the K most similar data as the classification of the new data. second, the algorithm flow1) Calculate the distance between the point in the data set of the known category and the current point;2) Sort by the increment order of distance;3) Select K points with the minimum distance from the current point;4) Determine the occurrence frequency of the category of the first k points;5) Returns the category with the highest frequency of the first K points as the predicted classification of the current point. three, the characteristics of the algorithmAdvantages: High precision, insensitive to outliers, no data input assumptions. Disadvantages: High computational complexity and high spatial complexity. applicable data range: Numerical and nominal type. iv. Python code implementation1. Create a data setdef create_data_set ():
Group = Array ([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])
Labels = [' A ', ' a ', ' B ', ' B ']
Return group, Labels2. Implement KNN algorithm

##############################
#功能: Dividing each set of data into a class
#输入变量: Inx, Data_set,labels,k
# Classification of vectors, sample data, tags, k nearest neighbor samples
#输出变量: sorted_class_count[0][0] Select the most recent category label
##############################

def classify0 (Inx, Data_set, labels, k):
Data_set_size = data_set.shape[0] # Gets the number of rows in the array

# using Tiles (Inx, (data_set_size, 1)) to construct data_set_size*1 Inx on the original basis
# each row of data corresponds to the coordinates of a vector point
# sum each row of data to get a data_set_size*1 matrix
# final calculation of Euclidean distance
Diff_mat = Tile (Inx, (data_set_size, 1))-data_set
Sq_diff_mat = diff_mat**2
Sq_distances = Sq_diff_mat.sum (Axis=1)
distances = sq_distances**0.5

# The Argsort function returns the index value of the array value from small to large
Sorted_dist_indicies = Distances.argsort ()

Class_count = {}
For I in Xrange (k):
Vote_label = Labels[sorted_dist_indicies[i]]

# get equals a if...else ... Statement
# If the parameter Vote_label is not in the dictionary then return parameter 0, if Vote_label returns Vote_label corresponding value value in the dictionary
Class_count[vote_label] = class_count.get (Vote_label, 0) + 1

# items Returns a key-value pair in a dictionary in a list, Iteritems returns a key-value pair with an iterator object, and the key-value pair is stored in tuples, which is the way [(), ()]
# operator.itemgetter (0) Gets the value of the No. 0 field of the object, which is the key value returned
# operator.itemgetter (1) Gets the value of the 1th field of the object, that is, the value is returned
# Operator.itemgetter defines a function that acts on an object to get a value
# reverse=true is sorted in descending order
Sorted_class_count = sorted (Class_count.iteritems (), Key=operator.itemgetter (1), reverse=true)

return sorted_class_count[0][0]

3. Code Testdef main ():
Group, labels = Create_data_set ()
Sorted_class_labels = Classify0 ([0, 0], group, labels, 3)
print ' sorted_class_labels= ', Sorted_class_labelsif __name__ = = ' __main__ ':
Main ()

Machine learning--k-Nearest neighbor (KNN) algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.