"Machine learning Combat" learning note a K proximity algorithm

Source: Internet
Author: User

First, K proximity algorithm idea: There is a collection of sample data, called the training sample set, and each data has a label, that is, we know each data in the sample set (here is a set of data, which can be the n-dimensional vector) and the corresponding relationship of the classification. After entering new data without a label, each feature of the new data (each element of the vector) is compared with the characteristics of the data in the sample set, and the algorithm extracts the category labels with the most similar characteristics in the sample set. Since the sample set can be very large, we select the first k most similar data and then count the labels with the most frequently occurring tags in k data as new data.

The general flow of the K proximity algorithm:

(1) Collect data: Can be local data, also can crawl from webpage.

(2) Prepare the data: will be structured, easy to operate.

(3) Analysis data: Any method can be used.

(4) Training algorithm: This step does not apply to the K proximity algorithm.

(5) Test algorithm: calculation error rate; Calculation formula: Error rate = number of test errors/Total tests

(6) Using the algorithm: Input sample data, output structured results, to determine which classification of new data.

An example of using K-Nearest neighbor algorithm

I'm using the Spyder development environment, and Python's version is 3.5,spyder's own NumPy library. Create a new knn.py file and complete this chapter's experiment in this document.

Write a data generation function in KNN:

1  fromNumPyImport*2 Importoperator3 4 defCreateDataSet ():5Group = Array ([[[1.0,1.1],[1.0,1.0],[0.0,0.0],[0.0,0.1]])6Labels = ['A','A','B','B']7     returnGroup,labels

In the Spyder, enter:

>>> Import KNN

>>>group,labels = Knn.createdataset ()

>>>group

Array ([[1., 1.1],
[1., 1.],
[0., 0.],
[0., 0.1]])

>>>labels

[' A ', ' a ', ' B ', ' B ']

The above hints indicate that the function is correct.

Third, k nearest Neighbor algorithm function

  

1 defclassify (inx,dataset,labels,k):2Datasetsize =Dataset.shape[0]3Diffmat = Tile (InX, (datasetsize,1))-DataSet4Sqdiffmat = diffmat**25Sqdistances = Sqdiffmat.sum (Axis=1)6distances = sqdistances**0.57Sorteddistindicies =Distances.argsort ()8ClassCount ={}9      forIinchRange (k):TenVoteilabel =Labels[sorteddistindicies[i]] OneClasscount[voteilabel] = Classcount.get (voteilabel,0) +1 ASortedclasscount =Sorted (Classcount.items (), -Key=operator.itemgetter (1), reverse=True) -     returnSORTEDCLASSCOUNT[0][0]

Verify: Enter in the Spyder

>>> knn.classify ([0,0],group,labels,3)

The output should be ' B '.

Iv. Examples: dating site matching improvements

Helen has been collecting dating data for some time, and she put the data in the text file Datingdata.txt, with each sample data taking up a row, a total of 1000 lines (she may have dated 1000 people, too scary ^_^), each sample mainly includes the following 3 features:

1. Number of frequent flyer miles earned per year

2. Percentage of time spent playing video games

3. Number of ice cream litres consumed per week

The above data is stored in a text file, between the data in a space interval, before the data input classifier, the processing data must be changed to the classifier can process the data, in KNN, a function named File2matrix, data processing.

1 defFile2matrix (filename):2FR = open (filename,'R')3Arrayolines =Fr.readlines ()4NumberOfLines =Len (arrayolines)5Returnmat = Zeros ((numberoflines,3))6Classlabelvector = []7index =08      forLineinchArrayolines:9line =Line.strip ()TenListfromline = Line.split (' \ t') OneReturnmat[index,:] = Listfromline[0:3] AClasslabelvector.append (int (listfromline[-1])) -Index + = 1 -     returnReturnmat,classlabelvector theRetarnmat,classlabelvector = File2matrix ('Datingdata.txt')

When I run this program, there is always an error message: Could not convert string to float: ' 12 34 56 ', for this problem, my method is to change the text between the data space to ', ' and will

Listfromline = Line.split (' \ t ') changed to
Listfromline = Line.split (', ')
This solves the problem, but it is not the best approach and needs to be improved.

"Machine learning Combat" learning note a K proximity algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.