First, the introduction of logistic regression
Logistic regression, also known as logistic regression analysis, is a generalized linear regression analysis model, which is commonly used in data mining, disease automatic diagnosis, economic prediction and other fields. For example, explore the risk factors that cause disease and predict the probability of disease occurrence based on risk factors. Taking the analysis of gastric cancer as an example, choose two groups of people, one group is gastric cancer group, one group is non-gastric cancer group, two groups of people must have different signs and lifestyles. Therefore, because the variable is gastric cancer, the value is "yes" or "no", the self-variable can include many, such as age, sex, dietary habits, Helicobacter pylori infection. An argument can be either continuous or categorical. Then through logistic regression analysis, we can get the weight of the independent variables, so that we can understand the factors which are the risk factors of gastric cancer. At the same time, it is possible to predict the likelihood of a person suffering from cancer based on the risk factors.
The principle and realization of logistic regression
The algorithm principle of logistic regression is similar to that of linear regression, except that the prediction function h and the weight update rule are different. The logistic regression algorithm is applied here to the multi-classification, because the Mnist data set is a total of 10 kinds of handwritten digital picture, so we should use 10 classifier model, find out each kind of best weight vector, and apply it to the prediction function, predict function value equal to probability, The class that makes the most of the predicted function value corresponds to the predicted class.
Introduction of data sets
MNIST datasets, MNIST datasets from the National Institute of Standards and Technology in the United States of America Institute of Standards and Technology (NIST). The training set (training set) consists of handwritten figures from 250 different people, 50% of them high school students and 50% from the staff of the Census Bureau (the census Bureau). The test set is also the same proportion of handwritten numeric data. There are 60000 pictures and corresponding labels in the training dataset, and there are 10000 pictures and corresponding labels in the test dataset, and each picture has 28*28 pixels. Figure 1 shows a rough picture of a handwritten image in a dataset.
Iv. code and results for logistic regression
Code:
From numpy Import *
Import operator
Import OS
Import NumPy as NP
Import time
From scipy.special import Expit
Import Matplotlib.pyplot as Plt
From matplotlib import cm
From OS import listdir
From Mpl_toolkits.mplot3d import Axes3d
Import struct
Import Math
#读取图片
def read_image (file_name):
#先用二进制方式把文件都读进来
File_handle=open (file_name, "RB") #以二进制打开文档
File_content=file_handle.read () #读取到缓冲区中
Offset=0
Head = Struct.unpack_from (' >iiii ', file_content, offset) # takes the first 4 integers, returns a tuple
Offset + = struct.calcsize (' >IIII ')
Imgnum = head[1] #图片数
rows = head[2] #宽度
cols = head[3] #高度
Images=np.empty ((Imgnum, 784)) #empty, is that all elements in the array that it is common are empty and have no practical meaning, it is the fastest way to create an array
image_size=rows*cols# the size of a single picture
Fmt= ' > ' + str (image_size) + ' B ' #单个图片的format
For I in Range (Imgnum):
Images[i] = Np.array (Struct.unpack_from (FMT, file_content, offset))
# Images[i] = Np.array (Struct.unpack_from (FMT, file_content, offset)). Reshape ((rows, cols))
Offset + = struct.calcsize (FMT)
return images
#读取标签
def read_label (file_name):
File_handle = open (file_name, "RB") # Opens the document in binary
File_content = File_handle.read () # read into buffer
Head = Struct.unpack_from (' >ii ', file_content, 0) # takes the first 2 integers, returns a tuple
offset = struct.calcsize (' >ii ')
Labelnum = head[1] # label number
# Print (Labelnum)
bitsstring = ' > ' + str (labelnum) + ' B ' # FMT format: ' >47040000b '
Label = Struct.unpack_from (bitsstring, file_content, offset) # takes data, returns a tuple
Return Np.array (label)
Def loaddataset ():
Train_x_filename= "Train-images-idx3-ubyte"
Train_y_filename= "Train-labels-idx1-ubyte"
Test_x_filename= "T10k-images-idx3-ubyte"
Test_y_filename= "T10k-labels-idx1-ubyte"
Train_x=read_image (Train_x_filename)
Train_y=read_label (Train_y_filename)
Test_x=read_image (Test_x_filename)
Test_y=read_label (Test_y_filename)
# # # # # #调试的时候让速度快点, reduce the size of the dataset first
# train_x=train_x[0:1000,:]
# train_y=train_y[0:1000]
# test_x=test_x[0:500,:]
# test_y=test_y[0:500]
Return train_x, test_x, train_y, test_y
def sigmoid (InX):
Return 1.0/(1+exp (-inx))
def classifyvector (inx,weights): #这里的inX相当于test_data, calculates the corresponding sigmoid with regression coefficients and eigenvectors as input
Prob=sigmoid (sum (inx*weights))
If Prob>0.5:return 1.0
Else:return 0.0
# Train_model (train_x, train_y, Theta, Learning_rate, Iteration,numclass)
def train_model (Train_x,train_y,theta,learning_rate,iterationnum,numclass): #theta是n + 1 rows of column vectors
M=TRAIN_X.SHAPE[0]
N=TRAIN_X.SHAPE[1]
Train_x=np.insert (Train_x,0,values=1,axis=1)
J_theta = Np.zeros ((iterationnum,numclass))
for k in range (Numclass):
# print (k)
Real_y=np.zeros ((m,1))
The index in the index=train_y==k#index that is equal to 0 in the train_y
real_y[index]=1# in real_y the corresponding index corresponding to the value of 1, first classify 0 and not 0
For j in Range (Iterationnum):
# print (j)
Temp_theta = Theta[:,k].reshape ((785,1))
#h_theta =expit (Np.dot (train_x,theta[:,k)) #是m matrix (column vector), which is the probability
H_theta = Expit (Np.dot (train_x, Temp_theta)). Reshape ((60000,1))
#这里的一个问题, turn train_y into 0 or 1.
J_theta[j,k] = (Np.dot (Np.log (H_theta). t,real_y) +np.dot ((1-real_y). T,np.log (1-h_theta))/(-M)
Temp_theta = Temp_theta + Learning_rate*np.dot (train_x.t, (Real_y-h_theta))
#theta [:, K] =learning_rate*np.dot (train_x.t, (Real_y-h_theta))
theta[:, K] = Temp_theta.reshape ((785,))
Return theta# Returns the Theta is the N*numclass matrix
def predict (Test_x,test_y,theta,numclass): #这里的theta是学习得来的最好的theta, is the matrix of N*numclass
Errorcount=0
test_x = Np.insert (test_x, 0, Values=1, Axis=1)
m = test_x.shape[0]
H_theta=expit (Np.dot (Test_x,theta)) #h_theta是m *numclass matrix because test_x is M*n,theta is N*numclass
H_theta_max = H_theta.max (Axis=1) # Gets the maximum value per row, H_theta_max is the matrix of m*1, the column vector
H_theta_max_postion=h_theta.argmax (Axis=1) #获得每行的最大值的label
For I in range (m):
If Test_y[i]!=h_theta_max_postion[i]:
Errorcount+=1
Error_rate = float (errorcount)/M
Print ("Error_rate", Error_rate)
Return error_rate
def mulitpredict (test_x,test_y,theta,iteration):
numpredict=10
Errorsum=0
for k in range (numpredict):
Errorsum+=predict (test_x,test_y,theta,iteration)
Print ("After%d iterations the average error rate is:%f"% (Numpredict, errorsum/float (numpredict)))
If __name__== ' __main__ ':
Print ("Start reading data ...")
Time1=time.time ()
Train_x, test_x, train_y, test_y = Loaddataset ()
Time2=time.time ()
Print ("Read data cost", Time2-time1, "second")
numclass=10
Iteration = 1
Learning_rate = 0.001
N=test_x.shape[1]+1
Theta=np.zeros ((N,numclass)) # Theta=np.random.rand (n,1) is a matrix of *numclass #随机构造n because there are numclass classifiers, So we should return a numclass column vector (n*1)
Print ("Start training Data ...")
Theta_new = Train_model (train_x, train_y, Theta, Learning_rate, Iteration,numclass)
Time3 = Time.time ()
Print ("Train data Cost", Time3-time2, "second")
Print ("Start predicting data ...")
Predict (test_x, test_y, theta_new,iteration)
Time4=time.time ()
Print ("Predict data Cost", Time4-time3, "second")
Results:
The experiment of the logistic regression classification mnist data set
The parameter learning rate used in the experiment was 0.001, and the classification error rate changed with the number of iterations, as shown in table 2.
Table 2 Classification error rates as the number of iterations changes
Number of iterations |
1 |
10 |
100 |
1000 |
Classification Error Rate |
0.90 |
0.35 |
0.15 |
0.18 |
As can be seen from table 2, the classification error rate increases slightly with the increase of the number of iterations.
Python Logistic regression classification mnist datasets