Preface
In this paper, how to use the KNN,SVM algorithm in Scikit learn library for handwriting recognition. Data Description:
The data has 785 columns, the first column is label, and the remaining 784 columns of data store the pixel values of the grayscale image (0~255) 28*28=784 installation Scikit Learn library
See a lot of installation tutorials, have not been installed successfully. Finally refer to the official Website installation documentation, only need to follow the steps to successfully install Scikit learn installation document functions: Principal component Analysis (Principal components ANALYSIS,PCA):
A technique for analyzing and simplifying data sets. Principal component analysis is often used to reduce the number of dimensions in a dataset while maintaining the characteristics that contribute the most to the difference in the data set. The process is to find the eigenvalues and eigenvectors of the covariance matrix, and to omit the higher-order principal components by preserving the lower-order principal components. Such low-order components tend to retain the most important aspects of the data.
C.F.:SVD Singular value analysis
In practice, SVD singular value analysis will be used to replace it, because the PCA computational amount is larger.
From sklearn.decomposition import PCA
#从sklearn中导入PCA
PCA = PCA (n_components=0.8,whiten=true)
#设置PCA参数
#n_components:
#设为大于零的整数, will automatically select N main components,
#设为分数时, select the eigenvalues of the total eigenvalue is greater than n, as the principal component
#whiten:
#True表示做白化处理, Bleaching is mainly done to make the processed data variance consistent
pca.fit_transform (data)
pca.transform (data)
#对数据data进行主成分分析
from Sklearn.neighbors Import Kneighborsclassifier
#导入Scikit the KNN algorithm in the Learn library
neighbors=kneighbors ([X, N_ Neighbors, return_distance])
#找到一个点的K近邻, number of n_neighbors neighbors
neighbors.fit (Training data,target values)
#对训练集的输入和输出进行训练
pre= neighbors.predict (Test samples)
#对测试集的输入进行预测, returns the predicted label
KNN Complete program and annotations
import pandas as PD from sklearn.decomposition import PCA from sklearn.neighbors import Kneig Hborsclassifier import time if __name__ = = "__main__": train_num = 20000 Test_num = 30000 data = Pd.read_csv (' Train.csv ') Train_data = data.values[0:train_num,1:] Train_label = data.values[0:train_num,0] Test_data = data . values[train_num:test_num,1:] Test_label = data.values[train_num:test_num,0] t = time.time () Pca=PCA (n_compo nents = 0.8) train_x = Pca.fit_transform (train_data) test_x = Pca.transform (test_data) neighbors = KNeighborsC Lassifier (n_neighbors=4) neighbors.fit (Train_x,train_label) pre= neighbors.predict (test_x) acc = float (pre==
Test_label). SUM ())/len (test_x) print U ' accuracy:%f, spend time:%.2fs '% (Acc,time.time ()-T)
Operation Result:Accuracy: 0.946000, take time: 7.98s SVM Method:
Support Vector Machines (SVM) is a supervised learning method, which can be widely used in statistical classification and regression analysis.
Support Vector machine constructs one or more high-dimensional hyper-planes to classify data points, which is the categorical boundary. Visually, a good classification boundary is as far away from the nearest training data as possible. In support vector machines, the distance between the classification boundary and the nearest training data point is called the interval (margin), and the target of support vector machine is to find the most spaced super plane as the classification boundary.
From Sklearn import SVM
#从sklearn库中导入svm
svc function
SVC=SVM. SVC (*c=1.0*, *kernel= ' RBF ' *, *degree=3*)
#C是惩罚因子
#kernel核方法, commonly used nuclear methods are: ' Linear ', ' poly ', ' rbf ', ' sigmoid ', ' Precomputed '
svc.fit (x, y, Sample_weight=none)
#对训练集的输入和输出进行训练
svc.predict (x)
#对测试集的输入进行预测, Returns the predicted label
# # # # #SVM完整程序及注解
import pandas as PD from sklearn.decomposition import PCA from Sklearn import SVM import time if __name__ = = "__main__": train_num = Test_num = 7000 data = pd.read_csv (' train.csv ') Train_data = d
ata.values[0:train_num,1:] Train_label = data.values[0:train_num,0] Test_data = data.values[train_num:test_num,1:] Test_label = data.values[train_num:test_num,0] t = time.time () #svm方法 PCA = PCA (n_components = 0.8,whiten = True) train_x = Pca.fit_transform (train_data) test_x = Pca.transform (test_data) svc = SVM. SVC (kernel = ' RBF ', C = ten) Svc.fit (train_x,train_label) Pre = Svc.predict (test_x) acc = float (pre==test_label ). sum ())/len (test_x) print U ' accuracy:%f, Elapsed time:%.2fs '% (Acc,time.time ()-T)
Run Result: Accuracy: 0.953000, elapsed time: 13.95s contrast:
In training 5,000 data, 2000 data are tested, SVM is more accurate than KNN and takes longer.