Python support vector machine classification mnist datasets

Source: Internet
Author: User
Tags svm

A support vector machine constructs a super-planar or hyper-planar set in a high-dimensional or infinite-dimensional space, which can be used for classification, regression, or other tasks. Visually, it is better to classify boundaries farther away from the nearest training data point, as this reduces the generalization error of the classifier.

Call SKLEARN.SVM's SVC function, classify the mnist data set, and output the overall classification accuracy, where two preprocessing methods are used (the number of eigenvalues to 0 or 1, the number of eigenvalues into 0-1 intervals), and two kernel functions (Gaussian kernel function and polynomial kernel function) are called respectively. In the support vector machine experiment, the training set and the test set are divided into 10 parts, and the average value of the whole classification accuracy of 10 data sets is obtained, so the result is more accurate and objective. Can be modified by the size of the penalty factor C to see different effects, and draw a graph to compare, c=100 when the effect is better.

#任务: Compare the results of different kernel and draw the corresponding curves to represent them visually.
Import struct
From numpy Import *
Import NumPy as NP
Import time
From SKLEARN.SVM import svc#c-support Vector classification

def read_image (file_name):
#先用二进制方式把文件都读进来
File_handle=open (file_name, "RB") #以二进制打开文档
File_content=file_handle.read () #读取到缓冲区中
Offset=0
Head = Struct.unpack_from (' >iiii ', file_content, offset) # takes the first 4 integers, returns a tuple
Offset + = struct.calcsize (' >IIII ')
Imgnum = head[1] #图片数
rows = head[2] #宽度
cols = head[3] #高度
Images=np.empty ((Imgnum, 784)) #empty, is that all elements in the array that it is common are empty and have no practical meaning, it is the fastest way to create an array
image_size=rows*cols# the size of a single picture
Fmt= ' > ' + str (image_size) + ' B ' #单个图片的format

For I in Range (Imgnum):
Images[i] = Np.array (Struct.unpack_from (FMT, file_content, offset))
# Images[i] = Np.array (Struct.unpack_from (FMT, file_content, offset)). Reshape ((rows, cols))
Offset + = struct.calcsize (FMT)
return images

#读取标签
def read_label (file_name):
File_handle = open (file_name, "RB") # Opens the document in binary
File_content = File_handle.read () # read into buffer

Head = Struct.unpack_from (' >ii ', file_content, 0) # takes the first 2 integers, returns a tuple
offset = struct.calcsize (' >ii ')

Labelnum = head[1] # label number
# Print (Labelnum)
bitsstring = ' > ' + str (labelnum) + ' B ' # FMT format: ' >47040000b '
Label = Struct.unpack_from (bitsstring, file_content, offset) # takes data, returns a tuple
Return Np.array (label)

def normalize (data): #图片像素二值化, becomes 0-1 distribution
M=DATA.SHAPE[0]
N=np.array (data). Shape[1]
For I in range (m):
For j in Range (N):
If data[i,j]!=0:
Data[i,j]=1
Else
Data[i,j]=0
Return data

The #另一种归一化的方法 is to change the eigenvalues to the number of [0,1] intervals
def normalize_new (data):
M=DATA.SHAPE[0]
N=np.array (data). Shape[1]
For I in range (m):
For j in Range (N):
Data[i,j]=float (Data[i,j])/255
Return data

Def loaddataset ():
Train_x_filename= "Train-images-idx3-ubyte"
Train_y_filename= "Train-labels-idx1-ubyte"
Test_x_filename= "T10k-images-idx3-ubyte"
Test_y_filename= "T10k-labels-idx1-ubyte"
Train_x=read_image (train_x_filename) #60000 matrix of *784
Matrix of Train_y=read_label (train_y_filename) #60000 * *
Test_x=read_image (test_x_filename) #10000 *784
Test_y=read_label (test_y_filename) #10000 * * *

#可以比较这两种预处理的方式最后得到的结果
# train_x=normalize (train_x)
# test_x=normalize (test_x)

# train_x=normalize_new (train_x)
# test_x=normalize_new (test_x)

Return train_x, test_x, train_y, test_y

If __name__== ' __main__ ':
classnum=10
score_train=0.0
score=0.0
temp=0.0
temp_train=0.0
Print ("Start reading data ...")
Time1=time.time ()
Train_x, test_x, train_y, Test_y=loaddataset ()
Time2=time.time ()
Print ("Read data cost", Time2-time1, "second")

Print ("Start training Data ...")
# clf=svc (c=1.0,kernel= ' poly ') #多项式核函数
CLF = SVC (c=0.01,kernel= ' RBF ') #高斯核函数

#由于每6000个中的每个类的数量都差不多相等, so the method of dividing directly according to the whole batch
For I in Range (Classnum):
Clf.fit (train_x[i*6000: (i+1) *6000,:],train_y[i*6000: (i+1) *6000])
Temp=clf.score (test_x[i*1000: (i+1) *1000,:], test_y[i*1000: (i+1) *1000])
# Print (temp)
Temp_train=clf.score (train_x[i*6000: (i+1) *6000,:],train_y[i*6000: (i+1) *6000])
Print (Temp_train)
score+= (Clf.score (test_x[i*1000: (i+1) *1000,:], test_y[i*1000: (i+1) *1000])/classnum)
score_train+= (Temp_train/classnum)

Time3 = Time.time ()
Print ("score:{:.6f}". Format (Score))
Print ("score:{:.6f}". Format (Score_train))
Print ("Train data Cost", Time3-time2, "second")

Experimental results: The results of different kernel functions and C after two-valued (normalize) were statistically and analyzed. The results are shown in the following table:

Parameter

Binary Value

{"C": 1, "" Kernel ":" Poly "}

{"Accuarcy": 0.4312, "Train Time": 558.61}

{"C": 1, "kernel": "RBF"}

{"Accuarcy": 0.9212, "Train Time": 163.15}

{"C": Ten, "kernel": "Poly"}

{"Accuarcy": 0.8802, "Train Time": 277.78}

{"C": Ten, "kernel": "RBF"}

{"Accuarcy": 0.9354, "Train Time": 96.07}

{"C": +, "kernel": "Poly"}

{"Accuarcy": 0.9427, "Train Time": 146.43}

{"C": +, "kernel": "RBF"}

{"Accuarcy": 0.9324, "Train Time": 163.99}

{"C": +, "kernel": "Poly"}

{"Accuarcy": 0.9519, "Train Time": 132.59}

{"C": +, "kernel": "RBF"}

{"Accuarcy": 0.9325, "Train Time": 97.54}

{"C": 10000, "kernel": "Poly"}

{"Accuarcy": 0.9518, "Train Time": 115.35}

{"C": 10000, "kernel": "RBF"}

{"Accuarcy": 0.9325, "Train Time": 115.77}

For the experimental optimization method, PCA Principal component analysis method can be used, the accuracy rate and speed are improved, the code is as follows:
Result screenshot:


Python support vector machine classification mnist datasets

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.