python羅吉斯迴歸分類MNIST資料集

最後更新：2018-07-28 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：建立 size python byte lse 連續 standards 由來參數

一、羅吉斯迴歸的介紹

　　logistic迴歸又稱logistic迴歸分析，是一種廣義的線性迴歸分析模型，常用於資料採礦，疾病自動診斷，經濟預測等領域。例如，探討引發疾病的危險因素，並根據危險因素預測疾病發生的機率等。以胃癌病情分析為例，選擇兩組人群，一組是胃癌組，一組是非胃癌組，兩組人群必定具有不同的體征與生活等。因此因變數就為是否胃癌，值為“是”或“否”，自變數就可以包括很多了，如年齡、性別、飲食習慣、幽門螺杆菌感染等。自變數既可以是連續的，也可以是分類的。然後通過logistic迴歸分析，可以得到自變數的權重，從而可以大致瞭解到底哪些因素是胃癌的危險因素。同時根據該權值可以根據危險因素預測一個人患癌症的可能性。

二、羅吉斯迴歸的原理和實現

　　羅吉斯迴歸的演算法原理和線性迴歸的演算法步驟大致相同，只是預測函數H和權值更新規則不同。羅吉斯迴歸演算法在這裡應用於多分類，由於MNIST的資料集是共有十類的手寫數字圖片，所以應該使用十個分類器模型，分別求出每類最好的權值向量，並將其應用到預測函數中，預測函數值相當於機率，使得預測函數值最大對應的類就是所預測的類。

三、資料集介紹

　　MNIST資料集，MNIST 資料集來自美國國家標準與技術研究所,National Institute of Standards and Technology (NIST). 訓練集 (training set) 由來自 250 個不同人手寫的數字構成, 其中 50% 是高中學生, 50% 來自人口普查局 (the Census Bureau) 的工作人員. 測試集(test set) 也是同樣比例的手寫數字資料。訓練資料集共有60000張圖片和相應的標籤，測試資料集共有10000張圖片和相應的標籤，並且每個圖片都有28*28個像素。圖1大致展示了資料集中的手寫圖片。

四、羅吉斯迴歸的代碼和結果

代碼：

from numpy import *
import operator
import os
import numpy as np
import time
from scipy.special import expit
import matplotlib.pyplot as plt
from matplotlib import cm
from os import listdir
from mpl_toolkits.mplot3d import Axes3D
import struct
import math
#讀取圖片
def read_image(file_name):
    #先用二進位方式把檔案都讀進來
    file_handle=open(file_name,"rb")  #以二進位開啟文檔
    file_content=file_handle.read()   #讀取到緩衝區中
    offset=0
    head = struct.unpack_from(‘>IIII‘, file_content, offset)  # 取前4個整數，返回一個元組
    offset += struct.calcsize(‘>IIII‘)
    imgNum = head[1]  #圖片數
    rows = head[2]   #寬度
    cols = head[3]  #高度

    images=np.empty((imgNum , 784))#empty，是它所常見的數組內的所有元素均為空白，沒有實際意義，它是建立數組最快的方法
    image_size=rows*cols#單個圖片的大小
    fmt=‘>‘ + str(image_size) + ‘B‘#單個圖片的format

    for i in range(imgNum):
        images[i] = np.array(struct.unpack_from(fmt, file_content, offset))
        # images[i] = np.array(struct.unpack_from(fmt, file_content, offset)).reshape((rows, cols))
        offset += struct.calcsize(fmt)
    return images

#讀取標籤
def read_label(file_name):
    file_handle = open(file_name, "rb")  # 以二進位開啟文檔
    file_content = file_handle.read()  # 讀取到緩衝區中

    head = struct.unpack_from(‘>II‘, file_content, 0)  # 取前2個整數，返回一個元組
    offset = struct.calcsize(‘>II‘)

    labelNum = head[1]  # label數
    # print(labelNum)
    bitsString = ‘>‘ + str(labelNum) + ‘B‘  # fmt格式：‘>47040000B‘
    label = struct.unpack_from(bitsString, file_content, offset)  # 取data資料，返回一個元組
    return np.array(label)

def loadDataSet():
    train_x_filename="train-images-idx3-ubyte"
    train_y_filename="train-labels-idx1-ubyte"
    test_x_filename="t10k-images-idx3-ubyte"
    test_y_filename="t10k-labels-idx1-ubyte"
    train_x=read_image(train_x_filename)
    train_y=read_label(train_y_filename)
    test_x=read_image(test_x_filename)
    test_y=read_label(test_y_filename)

    # # # #調試的時候讓速度快點，就先減少資料集大小
    # train_x=train_x[0:1000,:]
    # train_y=train_y[0:1000]
    # test_x=test_x[0:500,:]
    # test_y=test_y[0:500]

    return train_x, test_x, train_y, test_y

def sigmoid(inX):
    return 1.0/(1+exp(-inX))

def classifyVector(inX,weights):#這裡的inX相當於test_data,以迴歸係數和特徵向量作為輸入來計算對應的sigmoid
    prob=sigmoid(sum(inX*weights))
    if prob>0.5:return 1.0
    else: return 0.0
# train_model(train_x, train_y, theta, learning_rate, iteration,numClass)
def train_model(train_x,train_y,theta,learning_rate,iterationNum,numClass):#theta是n+1行的列向量
    m=train_x.shape[0]
    n=train_x.shape[1]
    train_x=np.insert(train_x,0,values=1,axis=1)
    J_theta = np.zeros((iterationNum,numClass))

    for k in range(numClass):
        # print(k)
        real_y=np.zeros((m,1))
        index=train_y==k#index中存放的是train_y中等於0的索引
        real_y[index]=1#在real_y中修改相應的index對應的值為1，先分類0和非0

        for j in range(iterationNum):
            # print(j)
            temp_theta = theta[:,k].reshape((785,1))
            #h_theta=expit(np.dot(train_x,theta[:,k]))#是m*1的矩陣（列向量）,這是機率
            h_theta = expit(np.dot(train_x, temp_theta)).reshape((60000,1))
            #這裡的一個問題，將train_y變成0或者1
            J_theta[j,k] = (np.dot(np.log(h_theta).T,real_y)+np.dot((1-real_y).T,np.log(1-h_theta))) / (-m)
            temp_theta = temp_theta + learning_rate*np.dot(train_x.T,(real_y-h_theta))

        #theta[:,k] =learning_rate*np.dot(train_x.T,(real_y-h_theta))
            theta[:, k] = temp_theta.reshape((785,))

    return theta#返回的theta是n*numClass矩陣

def predict(test_x,test_y,theta,numClass):#這裡的theta是學習得來的最好的theta，是n*numClass的矩陣
    errorCount=0
    test_x = np.insert(test_x, 0, values=1, axis=1)
    m = test_x.shape[0]


    h_theta=expit(np.dot(test_x,theta))#h_theta是m*numClass的矩陣，因為test_x是m*n，theta是n*numClass
    h_theta_max = h_theta.max(axis=1)  # 獲得每行的最大值,h_theta_max是m*1的矩陣，列向量
    h_theta_max_postion=h_theta.argmax(axis=1)#獲得每行的最大值的label
    for i in range(m):
        if test_y[i]!=h_theta_max_postion[i]:
            errorCount+=1

    error_rate = float(errorCount) / m
    print("error_rate", error_rate)
    return error_rate

def mulitPredict(test_x,test_y,theta,iteration):
    numPredict=10
    errorSum=0
    for k in range(numPredict):
        errorSum+=predict(test_x,test_y,theta,iteration)
    print("after %d iterations the average error rate is:%f" % (numPredict, errorSum / float(numPredict)))

if __name__==‘__main__‘:
    print("Start reading data...")
    time1=time.time()
    train_x, test_x, train_y, test_y = loadDataSet()
    time2=time.time()
    print("read data cost",time2-time1,"second")

    numClass=10
    iteration = 1
    learning_rate = 0.001
    n=test_x.shape[1]+1

    theta=np.zeros((n,numClass))# theta=np.random.rand(n,1)#隨機構造n*numClass的矩陣,因為有numClass個分類器，所以應該返回的是numClass個列向量（n*1）

    print("Start training data...")
    theta_new = train_model(train_x, train_y, theta, learning_rate, iteration,numClass)
    time3 = time.time()
    print("train data cost", time3 - time2, "second")

    print("Start predicting data...")
    predict(test_x, test_y, theta_new,iteration)
    time4=time.time()
    print("predict data cost",time4-time3,"second")

結果：

羅吉斯迴歸分類MNIST資料集的實驗

該實驗中用到的參數學習率是0.001，觀察分類錯誤率隨著迭代次數的變化情況，如表2所示。

表2 分類錯誤率隨著迭代次數的變化情況

迭代次數	1	10	100	1000
分類錯誤率	0.90	0.35	0.15	0.18

由表2可知，分類錯誤率隨著迭代次數的增加先大幅度的減少後略增加。

python羅吉斯迴歸分類MNIST資料集

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More