Python support vector machine classification mnist datasets

Last Update:2018-07-28 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A support vector machine constructs a super-planar or hyper-planar set in a high-dimensional or infinite-dimensional space, which can be used for classification, regression, or other tasks. Visually, it is better to classify boundaries farther away from the nearest training data point, as this reduces the generalization error of the classifier.

Call SKLEARN.SVM's SVC function, classify the mnist data set, and output the overall classification accuracy, where two preprocessing methods are used (the number of eigenvalues to 0 or 1, the number of eigenvalues into 0-1 intervals), and two kernel functions (Gaussian kernel function and polynomial kernel function) are called respectively. In the support vector machine experiment, the training set and the test set are divided into 10 parts, and the average value of the whole classification accuracy of 10 data sets is obtained, so the result is more accurate and objective. Can be modified by the size of the penalty factor C to see different effects, and draw a graph to compare, c=100 when the effect is better.

#任务: Compare the results of different kernel and draw the corresponding curves to represent them visually.
Import struct
From numpy Import *
Import NumPy as NP
Import time
From SKLEARN.SVM import svc#c-support Vector classification

def read_image (file_name):
#先用二进制方式把文件都读进来
File_handle=open (file_name, "RB") #以二进制打开文档
File_content=file_handle.read () #读取到缓冲区中
Offset=0
Head = Struct.unpack_from (' >iiii ', file_content, offset) # takes the first 4 integers, returns a tuple
Offset + = struct.calcsize (' &GT;IIII ')
Imgnum = head[1] #图片数
rows = head[2] #宽度
cols = head[3] #高度
Images=np.empty ((Imgnum, 784)) #empty, is that all elements in the array that it is common are empty and have no practical meaning, it is the fastest way to create an array
image_size=rows*cols# the size of a single picture
Fmt= ' > ' + str (image_size) + ' B ' #单个图片的format

For I in Range (Imgnum):
Images[i] = Np.array (Struct.unpack_from (FMT, file_content, offset))
# Images[i] = Np.array (Struct.unpack_from (FMT, file_content, offset)). Reshape ((rows, cols))
Offset + = struct.calcsize (FMT)
return images

#读取标签
def read_label (file_name):
File_handle = open (file_name, "RB") # Opens the document in binary
File_content = File_handle.read () # read into buffer

Head = Struct.unpack_from (' >ii ', file_content, 0) # takes the first 2 integers, returns a tuple
offset = struct.calcsize (' >ii ')

Labelnum = head[1] # label number
# Print (Labelnum)
bitsstring = ' > ' + str (labelnum) + ' B ' # FMT format: ' >47040000b '
Label = Struct.unpack_from (bitsstring, file_content, offset) # takes data, returns a tuple
Return Np.array (label)

def normalize (data): #图片像素二值化, becomes 0-1 distribution
M=DATA.SHAPE[0]
N=np.array (data). Shape[1]
For I in range (m):
For j in Range (N):
If data[i,j]!=0:
Data[i,j]=1
Else
Data[i,j]=0
Return data

The #另一种归一化的方法 is to change the eigenvalues to the number of [0,1] intervals
def normalize_new (data):
M=DATA.SHAPE[0]
N=np.array (data). Shape[1]
For I in range (m):
For j in Range (N):
Data[i,j]=float (Data[i,j])/255
Return data

Def loaddataset ():
Train_x_filename= "Train-images-idx3-ubyte"
Train_y_filename= "Train-labels-idx1-ubyte"
Test_x_filename= "T10k-images-idx3-ubyte"
Test_y_filename= "T10k-labels-idx1-ubyte"
Train_x=read_image (train_x_filename) #60000 matrix of *784
Matrix of Train_y=read_label (train_y_filename) #60000 * *
Test_x=read_image (test_x_filename) #10000 *784
Test_y=read_label (test_y_filename) #10000 * * *

#可以比较这两种预处理的方式最后得到的结果
# train_x=normalize (train_x)
# test_x=normalize (test_x)

# train_x=normalize_new (train_x)
# test_x=normalize_new (test_x)

Return train_x, test_x, train_y, test_y

If __name__== ' __main__ ':
classnum=10
score_train=0.0
score=0.0
temp=0.0
temp_train=0.0
Print ("Start reading data ...")
Time1=time.time ()
Train_x, test_x, train_y, Test_y=loaddataset ()
Time2=time.time ()
Print ("Read data cost", Time2-time1, "second")

Print ("Start training Data ...")
# clf=svc (c=1.0,kernel= ' poly ') #多项式核函数
CLF = SVC (c=0.01,kernel= ' RBF ') #高斯核函数

#由于每6000个中的每个类的数量都差不多相等, so the method of dividing directly according to the whole batch
For I in Range (Classnum):
Clf.fit (train_x[i*6000: (i+1) *6000,:],train_y[i*6000: (i+1) *6000])
Temp=clf.score (test_x[i*1000: (i+1) *1000,:], test_y[i*1000: (i+1) *1000])
# Print (temp)
Temp_train=clf.score (train_x[i*6000: (i+1) *6000,:],train_y[i*6000: (i+1) *6000])
Print (Temp_train)
score+= (Clf.score (test_x[i*1000: (i+1) *1000,:], test_y[i*1000: (i+1) *1000])/classnum)
score_train+= (Temp_train/classnum)

Time3 = Time.time ()
Print ("score:{:.6f}". Format (Score))
Print ("score:{:.6f}". Format (Score_train))
Print ("Train data Cost", Time3-time2, "second")

Experimental results: The results of different kernel functions and C after two-valued (normalize) were statistically and analyzed. The results are shown in the following table:

Parameter	Binary Value
{"C": 1, "" Kernel ":" Poly "}	{"Accuarcy": 0.4312, "Train Time": 558.61}
{"C": 1, "kernel": "RBF"}	{"Accuarcy": 0.9212, "Train Time": 163.15}
{"C": Ten, "kernel": "Poly"}	{"Accuarcy": 0.8802, "Train Time": 277.78}
{"C": Ten, "kernel": "RBF"}	{"Accuarcy": 0.9354, "Train Time": 96.07}
{"C": +, "kernel": "Poly"}	{"Accuarcy": 0.9427, "Train Time": 146.43}
{"C": +, "kernel": "RBF"}	{"Accuarcy": 0.9324, "Train Time": 163.99}
{"C": +, "kernel": "Poly"}	{"Accuarcy": 0.9519, "Train Time": 132.59}
{"C": +, "kernel": "RBF"}	{"Accuarcy": 0.9325, "Train Time": 97.54}
{"C": 10000, "kernel": "Poly"}	{"Accuarcy": 0.9518, "Train Time": 115.35}
{"C": 10000, "kernel": "RBF"}	{"Accuarcy": 0.9325, "Train Time": 115.77}

For the experimental optimization method, PCA Principal component analysis method can be used, the accuracy rate and speed are improved, the code is as follows:
Result screenshot:

Python support vector machine classification mnist datasets

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More