Machine learning Note (ii)--k-nearest neighbor algorithm

Source: Internet
Author: User

Overview of the K- neighbor algorithm

the K-nearest algorithm is classified by measuring the distance between different eigenvalue values.

Advantages: High accuracy, insensitive to outliers, no data input assumptions

Cons: High computational complexity, high spatial complexity

Use data range: Numeric and nominal

How it works : There is a collection of sample data (also known as a training sample set), and each data in the sample set has a label, that is, we know the corresponding relationship between each data in the sample set and the owning one. After entering new data without a label, each feature of the new data is compared to the feature in the sample set, and the algorithm extracts the category labels of the most similar data (nearest neighbor) in the sample set. In general, the first K most similar data in a sample dataset is selected, which is the source of K in the K- nearest neighbor algorithm , usually k is not much more than the whole number. Finally, select the most frequently occurring classification of the K Most similar data as the classification of the new data.

K- Neighbor Algorithm Code analysis:

For each point in the dataset of the Unknown category property, do the following:

(1) Calculate the distance between the point in the data set of the known category and the current point;

(2) Sorting in ascending order of distance;

(3) Select k points with the minimum distance from the current point;

(4) Determine The frequency of occurrence of the category of the first k points;

(5) Returns the category with the highest frequency of the first K points as the predicted classification of the current point.

The code is as follows:

1 defclassify0 (InX, DataSet, labels, k):2Datasetsize =Dataset.shape[0]3Diffmat = Tile (InX, (datasetsize,1))-DataSet4Sqdiffmat = diffmat**25Sqdistances = Sqdiffmat.sum (Axis=1)6distances = sqdistances**0.57Sorteddistindicies =Distances.argsort ()8Classcount={}          9      forIinchRange (k):TenVoteilabel =Labels[sorteddistindicies[i]] OneClasscount[voteilabel] = Classcount.get (voteilabel,0) + 1 ASortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=True) -     returnSORTEDCLASSCOUNT[0][0]

Code annotations:

"1"shape[0] calculates the number of rows of the matrix,shape[1] calculates the number of columns of the matrix

"2"tile array InX to datasetsize rows 1 column repeats, for example: IntX to be [0, 0] , you Tile after calculation

[0, 0]

[0, 0]

[0, 0]

[0, 0]

... datasetsize line

3 " **  refers to the sub-party,  DIFFMAT**2  diffmat For example Span style= "Font-family:times New Roman;" >[1, 2]**2 = [1, 4]

"4"sqdiffmat.sum (axis=1) refers to the and of each row of elements in an array, and these and then form a Array :

Example: >>>a = Array ([[[1, 2], [2, 4]])

>>>s = A.sum (Axis=1)

>>>s

Array ([3, 6])

>>>a = Array ([[1, 2, 3], [2, 3, 4]])

>>>s = A.sum (Axis=1)

>>>s

Array ([6, 9])

However, if the array has only one row, such as Array ([1, 2]), then sum (axis=1) cannot be used, only sum ( )

"5" ClassCount = {}Create a newDict,Dictprovided byGetmethod, ifKeydoes not exist, can returnNone, or the one you specifyvalue, hereclasscount.get (voteilabel, 0)means there is no relativeKeyValue ofvalueThe return0

For example: >>> d = {' Michael ': Up, ' Bob ': +, ' Tracy ': 85}

>>> d[' Michael ']

95

>>> d[' Thomas '

Traceback (most recent):

File "<stdin>", line 1, in <module>

Keyerror: ' Thomas '

To avoid a key that does not exist, there are two ways to determine whether a key exists by using in:

>>> ' Thomas ' in D

False

The second is the get method provided by Dict , if key does not exist, you can return None, or your own specified value:

>>> d.get (' Thomas ')

>>> d.get (' Thomas ',-1)

-1

"6"sorted () by classcount dictionary 2 Elements (that is, the number of occurrences of a category) from large to small

Test the code to run the effect:

knn.py File:

1  fromNumPyImport*2 Importoperator3 defclassify0 (InX, DataSet, labels, k):4Datasetsize =Dataset.shape[0]5Diffmat = Tile (InX, (datasetsize,1))-DataSet6Sqdiffmat = diffmat**27Sqdistances = Sqdiffmat.sum (Axis=1)8distances = sqdistances**0.59Sorteddistindicies =Distances.argsort ()TenClasscount={}           One      forIinchRange (k): AVoteilabel =Labels[sorteddistindicies[i]] -Classcount[voteilabel] = Classcount.get (voteilabel,0) + 1 -Sortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=True) the     returnSortedclasscount[0][0] -  - defCreateDataSet (): -Group = Array ([[[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) +Labels = ['A','A','B','B'] - returnGroup, labels

Machine learning Note (ii)--k-nearest neighbor algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.