Machine Learning--KNN Algorithm case

Last Update:2015-04-17 Source: Internet

Author: User

Tags ranges

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. Improve the pairing effect of dating sites

Target variables you want to predict: people you don't like, people who are charming, people who are very attractive.

Sample characteristics: Number of frequent flyer miles earned per year, percentage of time spent playing video games, number of ice cream litres consumed per week

The date data information is stored in the text file DatingTestSet.txt, with each sample data occupying one row and a total of 1000 rows.

Implementation steps

1. Parsing the data in a text file

###################################
#功能: Parsing data in a text file into matrix form
#输入变量: filename string
#输出变量: Return_mat, class_label_vector file converted matrix, class label vector
###################################

def file2matrix (filename):
FR = open (filename, ' R ') # read-only opening file
Array_of_lines = Fr.readlines () # readlines () to read all data so that the file resource can be released as soon as possible
Number_of_lines = Len (array_of_lines) # Gets the number of file lines
Return_mat = Zeros ((number_of_lines, 3)) # Create return matrix

Class_label_vector = []
index = 0

Label_dict = {' largedoses ': 3, ' smalldoses ': 2, ' Didntlike ': 1}

# readlines () reads the entire file and makes a list of rows, each as a loop for the for
For line in Array_of_lines:

#先去除字符串两边的空格, and then split the string with the tab delimiter
List_form_line = Line.strip (). Split (' \ t ')

Return_mat[index,:] = List_form_line[0:3] # Instead of each row of data

# Use negative subscript to select List_form_line last column
# Add three types of people to the Class_label_vector list
Class_label_vector.append (Label_dict[list_form_line[-1])

Index + = 1
Return Return_mat, Class_label_vector

2, normalized characteristic value

When dealing with eigenvalues of different ranges of values, in order to avoid weight imbalance, numeric values are usually normalized, and the function can automatically convert numeric eigenvalues to intervals ranging from 0 to 1.

###################################
#功能: Normalization of values
#输入变量: Data_set Sample Data
#输出变量: Norm_data_set, ranges, min_vals sample after normalization, range of values, minimum value
###################################
def auto_norm (Data_set):
min_values = data_set.min (0) # parameter 0 allows the function to select the minimum value from the column instead of selecting the minimum value of the current row
max_values = Data_set.max (0)

Ranges = Max_values-min_values

m = data_set.shape[0] # Gets the number of rows in the array
Diff_data_set = Data_set-tile (Min_values, (M, 1))
Norm_data_set = Diff_data_set/tile (ranges, (M, 1)) # Divide the corresponding values

Return norm_data_set, ranges, min_values

3. Test the classifier effect with dating site data

A very important task of the machine learning algorithm is to evaluate the accuracy of the algorithm, usually using 90% of the existing data as a training sample, and the remaining 10% data to test the classifier. It should be noted that 10% of the data should be randomly selected. For classifiers, the error rate is a measure of performance. The error rate is the number of error results divided by the total number of test data, the perfect classifier error rate is 0, and a classifier with an error rate of 1.0 does not give any correct classification results.

###################################
#功能: Test the effect of the classifier
###################################
Def dating_class_test ():
Ho_ratio = 0.10 # percent of test data

Dating_data_mat, dating_labels = File2matrix (' datingTestSet.txt ')
Norm_mat, ranges, min_values = Auto_norm (Dating_data_mat)

m = norm_mat.shape[0]
num_test_vectors = Int (m*ho_ratio) # 10% data as Test

Error_count = 0.0
For I in Xrange (num_test_vectors):
Classifier_result = Classify0 (Norm_mat[i,:], norm_mat[num_test_vectors:m,:],
DATING_LABELS[NUM_TEST_VECTORS:M], 3)
Print "The classifier came back with:%d, the real answer is:%d"% \
(Classifier_result, Dating_labels[i])

If Classifier_result! = Dating_labels[i]:
Error_count + = 1.0
Print "The total error rate is:%f"% (error_count/num_test_vectors)

4, judging from the error rate, the classifier performance is good, next will use the classifier to predict the degree of liking

###################################
#功能: Enter a person's information to predict the level of preference
###################################
Def Classify_person ():
Result_list = [' Not @ all ', ' in small doses ', ' in large doses ']

# Number of frequent flyer miles
Ff_miles = float (raw_input ("Frequent flier miles earned per year:"))
# Percentage of time spent playing video games
Percent_tats = float (raw_input ("Percentage of time spent playing video games:"))
# Number of ice cream litres consumed per week
Ice_cream = float (raw_input ("liters of ice cream consumed per week:"))

Dating_data_mat, dating_labels = File2matrix (' datingTestSet.txt ')
Norm_mat, ranges, min_values = Auto_norm (Dating_data_mat)

In_arr = Array ([Ff_miles, Percent_tats, Ice_cream])
Classifier_result = Classify0 ((in_arr-min_values)/ranges, Norm_mat, Dating_labels, 3)

Print "You'll probably like this person:", result_list[classifier_result-1]

Second, handwriting recognition system

As a binary image file of 32 rows and 32 columns consisting of 0 and 1, the handwritten characters are numbered 0 through 9. Can be seen

target variables to predict: numbers from 0 to 9

Sample characteristics: None

Handwritten text data has two subdirectories: The directory trainingdigits contains about 2000 examples, each with 0 to 9 digital images, each with about 200 samples, and the directory testdigits contains about 900 test data. Use the data in trainingdigits to train the classifier, and use the data in the Testdigits to test the classifier effect.

Implementation steps:

1, the image file data into a vector, the 32*32 binary image matrix into a 1*1024 vector, so that the classifier can process digital image information.

###################################
#功能: Converts an image to a vector and converts a 32*32 binary image into a 1*1024 vector
#输入变量: filename string
#输出变量: Return_vector converted Vector
###################################
def img2vector (filename):
Return_vector = Zeros ((1, 1024))
FR = open (filename, ' R ') # read-only opening file
For I in Xrange (32):
Line_str = Fr.readline () # ReadLine () reads one row of data at a time
For J in Xrange (32):
Return_vector[0, 32*i+j] = Int (line_str[j]) # 32*32 data written in one line

Return Return_vector

2. Using KNN algorithm to recognize handwritten numerals

###################################
#功能: Test Handwriting recognition system
###################################
Def hand_writing_class_test ():
Hw_labels = []
Training_file_list = Listdir (' trainingdigits ') # Stores file names (for example ' 0_0.txt ') in a directory in the list
m = Len (training_file_list) # Number of calculated files
Training_mat = Zeros ((m, 1024))

For I in Xrange (m):
FILE_NAME_STR = Training_file_list[i]
File_str = File_name_str.split ('. ') [0] # get a filename such as [0_12.txt],[0] is the first data 0_12
class_num_str = Int (File_str.split ('_') [0]) # Gets an array such as [0_12] Gets the first number [0]
Hw_labels.append (CLASS_NUM_STR) # adds the obtained number to the label array

Training_mat[i,:] = Img2vector (' trainingdigits/%s '% file_name_str) # data format conversion

Test_file_list = Listdir (' testdigits ')
Error_count = 0.0
M_test = Len (test_file_list)

For I in Xrange (m_test):
FILE_NAME_STR = Test_file_list[i]
File_str = File_name_str.split ('. ') [0]
class_num_str = Int (File_str.split ('_') [0])

Vector_under_test = Img2vector (' testdigits/%s '% file_name_str)

Classifier_result = Classify0 (Vector_under_test, Training_mat, Hw_labels, 3)
Print "The classifier came back with:%d, the real answer is:%d"% (Classifier_result, CLASS_NUM_STR)

If Classifier_result! = class_num_str:
Error_count + = 1.0

Print "The total number of errors is:%d"% error_count
Print "The total error rate is:%f"% (Error_count/float (m_test))

def main ():

Dating_data_mat, dating_labels = File2matrix (' datingTestSet.txt ')
print ' dating_data_mat= ', Dating_data_mat
print ' dating_labels= ', dating_labels

Norm_mat, ranges, min_values = Auto_norm (Dating_data_mat)
print ' norm_mat= ', Norm_mat
print ' ranges= ', ranges
print ' min_values= ', min_values

Dating_class_test ()

Classify_person ()

Test_vector = Img2vector (' testdigits/0_13.txt ')
print ' test_vector= ', test_vector[0, 0:31]
print ' test_vector= ', test_vector[0, 32:63]

Hand_writing_class_test ()

if __name__ = = ' __main__ ':
Main ()

Machine Learning--KNN Algorithm case

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More