[Python] 如何證明兩組樣本有顯著性差異(t-test假設檢驗)

最後更新：2018-12-04 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

原文地址：http://blog.csdn.net/ariesjzj/article/details/7546930

現有兩組樣本資料，假如它們分別基於兩套不同的方法，或者測於不同的裝置，又或是出自兩個人之手，如何證明它們有或沒有顯著性差別呢？當然可以拿個Excel表把資料畫個圖，然後找一堆人來投票，看覺得差不多還是覺得差得多的人哪方票數高。但終歸這種做法有些主觀，不夠說明力。機率統計給出了一種更為客觀的基於統計的方法，這裡是一個Python的實現：

#!/usr/bin/python# Paired difference hypothesis testing - Student's t-test# Author:       Jin Zhuojun# History:      Create      Tue May  8 16:12:21 CST 2012import stringimport mathimport sysfrom scipy.stats import  timport matplotlib.pyplot as pltimport numpy as np############### Parameters ###############ver = 1verbose = 0alpha = 0.05def usage():    print """    usage: ./program data_file(one sample in one line)    """def main():    ##########    # Sample #    ##########    if (len(sys.argv) < 2):        usage()        sys.exit()    # f = open('./data.txt')    # f = open('./test.txt')    f = open(sys.argv[1])    try:        sample1_text = f.readline()        sample2_text = f.readline()    finally:        f.close()    if (verbose):        print("sample1 text: ", sample1_text)        print("sample2 text: ", sample2_text)    sample1_text_split = sample1_text.split()    sample2_text_split = sample2_text.split()    if (verbose):        print(sample1_text_split)        print(sample2_text_split)        print("len = ", len(sample1_text_split))    assert(len(sample1_text_split) == len(sample2_text_split))    sample_len = len(sample1_text_split)    sample1 = []    sample2 = []    for i in range(sample_len):        sample1.append(string.atof(sample1_text_split[i]))        sample2.append(string.atof(sample2_text_split[i]))    sample_diff = []    for i in range(sample_len):        sample_diff.append(sample1[i] - sample2[i])    if (verbose):        print("sample_diff = ", sample_diff)    ######################    # Hypothesis testing #    ######################    sample = sample_diff    numargs = t.numargs    [ df ] = [sample_len - 1,] * numargs    if (verbose):        print("df(degree of freedom, student's t distribution parameter) = ", df)    sample_mean = np.mean(sample)    sample_std = np.std(sample, dtype=np.float64, ddof=1)    if (verbose):        print("mean = %f, std = %f" % (sample_mean, sample_std))    abs_t = math.fabs( sample_mean / (sample_std / math.sqrt(sample_len)) )    if (verbose):        print("t = ", abs_t)    t_alpha_percentile = t.ppf(1 - alpha / 2, df)    if (verbose):        print("abs_t = ", abs_t)        print("t_alpha_percentile = ", t_alpha_percentile)    if (abs_t >= t_alpha_percentile):        print "REJECT the null hypothesis"    else:        print "ACCEPT the null hypothesis"    ########    # Plot #    ########    rv = t(df)    limit = np.minimum(rv.dist.b, 5)    x = np.linspace(-1 * limit, limit)    h = plt.plot(x, rv.pdf(x))    plt.xlabel('x')    plt.ylabel('t(x)')    plt.title('Difference significance test')    plt.grid(True)    plt.axvline(x = t_alpha_percentile, ymin = 0, ymax = 0.095,             linewidth=2, color='r')    plt.axvline(x = abs_t, ymin = 0, ymax = 0.6,             linewidth=2, color='g')    plt.annotate(r'(1 - $\alpha$ / 2) percentile', xy = (t_alpha_percentile, 0.05),            xytext=(t_alpha_percentile + 0.5, 0.09), arrowprops=dict(facecolor = 'black', shrink = 0.05),)    plt.annotate('t value', xy = (abs_t, 0.26),            xytext=(abs_t + 0.5, 0.30), arrowprops=dict(facecolor = 'black', shrink = 0.05),)    leg = plt.legend(('Student\'s t distribution', r'(1 - $\alpha$ / 2) percentile', 't value'),             'upper left', shadow = True)    frame = leg.get_frame()    frame.set_facecolor('0.80')    for i in leg.get_texts():        i.set_fontsize('small')    for l in leg.get_lines():        l.set_linewidth(1.5)    normalized_sample = [0] * sample_len     for i in range(0, sample_len):        normalized_sample[i] = (sample[i] - sample_mean) / (sample_std / math.sqrt(sample_len))    plt.plot(normalized_sample, [0] * len(normalized_sample), 'ro')    plt.show()if __name__ == "__main__":    main()

註：因為matplotlib顯示時座標比例老是不對，所以畫圖的時候把座標寫死了。

把上面的指令碼存為main.py，再如資料檔案為data.txt（每行表示一組樣本）:

0.26 0.30 0.40 0.49 0.43 0.84 1.18 0.25 0.67
0.18 0.33 0.57 0.32 0.58 0.29 0.98 0.17 0.79

運行：

$ ./main.py data.txt

得到

這裡假設null hypothesis為兩組資料無顯著性差異，alternative hypothesis為有顯著性差異。，t value落在了拒絕域外，因此我們拒絕alternative hypothesis，接受null hypothesis，即兩組資料無顯著性差異。

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More