標籤:des http 使用 strong 檔案 資料
一、 在SAS中進行隨機抽樣:
1、 在實際資料處理中常常需要進行樣本抽樣,在實踐中主要有兩種情況:
(1)簡單無重複抽樣
(2)分層抽樣 a.等比例分層抽樣 b. 不等比例分層抽樣;
2、SAS 中可以利用PROC suveryselect 過程實現各種抽樣:
其一般形式是:
PROC SURVEYSELECT data=<來源資料集名> method = <srslursl sys > out=<抽取樣本存放的資料集> n=<抽取數量>(or samprate=抽樣比例) seed =n;
strata <指定分層變數>;
id <指定抽取的樣本所保留的來源資料集變數>;
run;
說明:method用來指定隨機抽樣方法的,其中SRS是指不放回簡單隨機抽樣(Simple Random Samping);urs是指放回簡單隨機抽樣(Unrestricted Random Sampling);sys是指系統抽樣(Systematic Sampling)。seed用來指定隨機種子數,為非負整數,取0則每次抽取的樣本不同,若取大於0的整數,則下次抽樣時若輸入相同值即可得到相同的樣本;id是指定從來源資料集複製到樣本資料集的變數,若預設,則複製所有變數。
3、簡單無重複隨機抽樣舉例:
/*按30%的比例從test資料集中抽取樣本,並把樣本輸出到results資料集中*/
proc surveyselect data=test1 out=results1 method=srs samprate=0.3;
run;
4、分層等比例隨機抽樣舉例;
proc sort data=test2;
by 分層變數;
run; /**先用分層變數對總體樣本進行排序/
proc surveyselect data=test2 out=results2 method=srs samprate=0.1;
strata 分層變數;
run; /*根據分層變數等比例從總體中抽取樣本*/
5、分層不等比例抽樣舉例;
(1)手工設定抽樣比例或者抽樣數
proc sort data=test3;
by 分層變數;
run; /**先用分層變數對總體樣本進行排序/
proc surveyselect data=test3 out=results3 method=srs
samprate=(0.1,0.3,0.5,0.2); /*根據分層情況設定每一層要抽取的比例*/
strata 分層變數;
run; /*根據分層變數不等比例從總體中抽取樣本*/
proc surveyselect data=test3 out=results3 method=srs
n=(30,20,50,40); /*根據分層情況設定每一層要抽取的樣本數*/
strata 分層變數;
run;
(2)根據抽樣表進行不等比例抽樣
proc sort data=test3;
by 分層變數;
run; /**先用分層變數對總體樣本進行排序/
proc surveyselect data=test3 out=results3 method=SRS
samprate=samp_table; /*通過抽樣比例資料集進行抽樣,samp_table資料集中要包括分層變數 以及每一分層對應的抽樣比例或者數量,如果按比例抽樣變數必須用_rate_來命名抽樣比例,如果是按數量抽樣必須用_nsize_來命名抽樣數量*/
strata 分層變數;
run;
6、關於surveyselect過程的更多內容詳見SAS協助
在命令欄輸入 help surveyselect 然後按enter鍵即可。
二、把資料分成訓練集和測試集的程式:
1、data train(drop=u) validate(drop=u);
set develop1;
u=ranuni(27513);
if u<=.67 then output train;
else output validate;
run;
A DATA step is used to split DE VELOP into TRAIN and VALIDATE. The
variable U is created using the RANUNI function, which generates pseudo-random numbers from a uniform distribution on the interval (0,1). The
RANUNI function argument is an initialization seed. Using a particular
number, greater than zero, will produce the same split each time the DATA
step is run. If the seed was zero, then the data would be split differently each
time the DATA step was run. The IF statement puts approximately 67% of
the data into TRAIN and approximately 33% of the data into VALIDATE
because on average 33% of the values of a uniform random variable are
greater than 0.67.
三、 SAS logistic中class選項的使用:
Class語句的作用其實是設定虛擬變數。
- 如果是兩分類變數,其實分類不分類效果都是一樣的;
- 多分類變數,如果是數值型,不指定分類,SAS會當做連續變數處理(分類變數時有序分類變數時,比如年齡段,工資水平等,記成1,2,3,4等等,這樣處理是有意義的,如果是無序分類變數,這樣處理就沒有任何意 義);如果是非數值型,SAS會報錯;
- 對待多分類無序變數,是必須要設定成分類變數的,否則SAS會報錯;
SAS對分類變數,採用param=ref;
即分類變數 Response Profile是
[1 0
0 1
0 0]
的形式;
作為參照對象的參數估計為0;這樣在估計oddratios比較方便,分類水平vs參照的水平的優勢比=exp(該水平參數估計)
- 不指定ref參數,SAS會以最大值最為參照水平,比如sex會以1作為參照水平;在沒有設定param參數時,SAS預設是param=effect;
Response Profile
[1 0
0 1
-1 -1]
的形式;
5、ref=看你想把什麼設定成參照水平:
Ref=first、ref=last、或ref=”某類別賦值”,表示以第一類、最後一類或其中的某一類作為參照組;
param=ref 主要強調參數估計,計算優勢比oddsratio比較方便;
param=effect主要強調假設檢驗,方便互動分析。
總的來說,這兩個參數的設定只是編碼方式不同,不同的設定只是為了方便後面的估計或檢驗的計算,最後的運算結果應該是一樣的。
四、logistic in SAS
1、計算score:
The SCORE procedure multiplies values from two SAS data sets, one
containing coefficients (SCORE=) and the other containing the data to be
scored (DATA=). The data set to be scored typically would not have a target
variable. The OUT= option specifies the name of the scored data set created
by PROC SCORE. The TYPE=PARMS option is required for scoring
regression models.
proc score data=read.new out=scored score=betas1
type=parms;
var dda ddabal dep depamt cashbk checks;
run;
Data can also be scored directly in PROC LOGISTIC using the OUTPUT
statement. This has several disadvantages over using PROC SCORE: it does
not scale well with large data sets, it requires a target variable (or some
proxy), and the adjustments for oversampling, discussed in a later section,
are not automatically applied.
2、 缺失值的填充:
The STDIZE procedure with the REPONLY option can be used to replace missing values. The METHOD= option allows you to choose several different location measures such as the mean, median, and midrange. The output data set created by the OUT= option contains all the variables in the input data set where the variables listed in the VAR statement are imputed. Only numeric input variables should be used in PROC STDIZE.
proc stdize data=develop1 reponly method=median out=imputed;
var &inputs;
run;
proc print data=imputed(obs=20);
var ccbal miccbal ccpurc miccpurc income miincome
hmown mihmown;
run;
PROC STANDARD with the REPLACE option can be used to replace missing values with the mean of that variable on the non-missing cases.
3、 模型驗證及評價部分(ROC及logistic procedure驗證):
The INEST= option on the PROC LOGISTIC statement names the data set that contains initial parameter estimates for starting the iterative ML estimation algorithm. The MAXITER= option in the MODEL statement specifies the maximum number of iterations to perform. The combination of DATA=validation data, INEST= final estimates from training data , and MAXITER=0 causes PROC LOGISTIC to score, not refit, the validation data. The OFFSET= option is also needed since the offset variable was used when creating the final parameter estimates from the training data set.
The OUTROC= option creates an output data set with sensitivity (_SENSIT_) and one minus specificity (_1MSPEC_) calculated for a full range of cutoff probabilities (_PROB_). The other statistics in the OUTROC= data set are not useful when the data is oversampled. The two variables _SENSIT_ and _1MSPEC_ in the OUTROC= data set are correct whether or not the validation data is oversampled. The variable _PROB_ is correct, provided the INEST= parameter estimates were corrected for oversampling using sampling weights. If they were not corrected or if they were corrected with an offset, then _PROB_ needs to be adjusted using the formula (shown in Section 2.2).
proc logistic data=validate des inest=betas;
model ins=&selected / maxiter=0 outroc=roc offset=off;
run;
However, this model should be assessed using the validation data set because the inclusion of many higher order terms may increase the risk of overfitting.
proc logistic data=train1 des outest=betas;
model ins=miphone CHECKS MM CD brclus1 DDABAL TELLER
SAVBAL CASHBK brclus3 ACCTAGE SAV DDA ATMAMT
PHONE INV ATM savbal*savbal ddabal*ddabal
ddabal*savbal atmamt*atmamt
savbal*dda brclus1*atmamt mm*savbal
acctage*acctage miphone*brclus1
checks*ddabal ddabal*phone ddabal*brclus3
mm*phone sav*dda mm*dda
cashbk*acctage;
run;
proc logistic data=validate des inest=betas;
model ins=miphone CHECKS MM CD brclus1 DDABAL TELLER
SAVBAL CASHBK brclus3 ACCTAGE SAV DDA ATMAMT
PHONE INV ATM savbal*savbal ddabal*ddabal
ddabal*savbal atmamt*atmamt
savbal*dda brclus1*atmamt mm*savbal
acctage*acctage miphone*brclus1
checks*ddabal ddabal*phone ddabal*brclus3
mm*phone sav*dda mm*dda
cashbk*acctage / maxiter=0 ;
run;
4、 CROSS-VALIDATION
(1)K-fold
%let k=5;
data xx10f;
do replicate=1 to &k ;
do rec=1 to numrecs;
set mylib.stu nobs=numrecs point=rec;
%let m=floor(numrecs/&k);
/* if replicate ^= rec then output;*/
if replicate ^=ceil(rec/&m) then do;
new_y=y;
selected=1;
end;
else do;
new_y=.;
selected=0;
end;
output;
end;
end;
stop;
run;
(2)LOOCV
data xx;
do replicate=1 to numrecs ;
do rec=1 to numrecs;
set mylib.stu nobs=numrecs point=rec;
/* if replicate ^= rec then output;*/
if replicate ^= rec then new_y=y;
else new_y=.;
output;
end;
end;
stop;
run;
(3)bootstrapping
%let K=3;
%let rate=%sysevalf((&k-1)/&k);
proc surveyselect data=temp1 out=xv seed=7589747 method=urs
samprate=&rate outall rep=k
run;
data xv;
set xv;
if selected then new_y=y;
run;
五、運行程式時,日誌視窗老是提示滿了而中斷運行,怎麼辦?
(1)option nonotes; 可以讓SAS不輸出notes。
也可以用proc printto;來指定log的內容到外部檔案:
*** point log to an external file.;
proc printto log="c:\test.txt";
run;
--Your program---;
*** Point the log to its default destination;
proc printto;run;
(2)不組建記錄檔的設定:
You can do as,
proc printto log=_null_;
run;
proc print data=sashelp.class;
run;
%put ‘not thing show‘ _all_;
proc printto log=log;
run;
proc print data=sashelp.class;
run;
%put _all_;
(3)不產生log,可運行:
- options nosource nonotes errors=0;
不產生log可提高運行速度
六、利用ROC曲線尋找合理的界值(cut-off 值):
ROC曲線可結合靈敏度和特異度尋找一個界值,使靈敏度和特異度結合得最優。通常有兩種方式:一是根據各個靈敏度和特異度,計算使(靈敏度+特異度-1)取值最大的一個點,作為界值;二是從ROC曲線圖中尋找一個最靠近左上方的點作為界值。通過這兩種方式尋找的點在通常情況下是一致的。
七、exist of maximum likelihood的解決方案:
(1)Exact語句對指定的變數執行精確檢驗。在例數不多、模型不大的情況下,當結果不穩定時,可以用到這一選項(分類變數須添加param=ref這一參數),指定需執行精確檢驗的變數;如果樣本量大、模型複雜,執行這個語句就會提示記憶體不足了;
(2)Strata語句是SAS 9.0版本以後加上的,專門用於匹配設計的logistic迴歸分析。該語句實現了應用proc logistic命令對1:1、1:m、m:n等多配比資料進行分析。Strata語句主要指定匹配組變數,在病例交叉研究中,每個個體是一個匹配組,因此個體編號就是strata語句需指定的匹配組變數。
(3)SAS9.2以後的版本可使用firth選項解決這個問題。
八、分層等比例抽樣:
使用surveyselect中的strata語句可對變數進行分層抽樣,這樣,產生的cv資料集中各組的0和1的比例是相同的。
九、logistic 過程中“檢驗全域零假設”這部分是模型總體檢驗結果。
Logistic迴歸的單因素分析結果與卡方檢驗結果一致。有的文章中採用logistic迴歸進行單因素分析,有的採用卡方檢驗進行單因素分析,實際上結果是相同的。
十、秩和檢驗的適用範圍
如果兩個樣本來自兩個獨立的但非正態或形態不清的兩總體,要檢驗兩樣本之間的差異是否顯著,不應運用參數檢驗中的T檢驗,而需採用秩和檢驗。
秩和檢驗
應用條件
①總體分布形式未知或分布類型不明;
②偏態分布的資料:
③等級資料:不能精確測定,只能以嚴重程度、優劣等級、次序先後等表示;
④不滿足參數檢驗條件的資料:各組方差明顯不齊。
⑤資料的一端或兩端是不確定數值,如“
>50mg
”等。
十一、proc logistic 中model statement後面的選項aggregate scale= 及rsquare(引用:《醫學案例統計分析與SAS應用》P172-178)
Model y=chage rs2 rs3 lc mr/aggregate scale=none;
/*選項aggregate 和scale輸出Pearson 卡方和Deviance值,用於擬合優度評價,rsquare輸出廣義R2。*/
如果Deviance和Pearson 卡方的P值均較低,提示模型擬合不充分。Deviance和Pearson 卡方值均大於1,提示可能存在過離散現象。(即兩者的值應較小,P值應較大才好?)
若去掉無統計意義的變數後,Pearson 卡方和Deviance值仍大於1,提示過離散現象的存在。我們可以採用Pearson 卡方和Deviance統計量來進行調整。這裡我們採用Pearson 卡方進行調整,只要將選項改為 scale=pearson,則結果中共變數矩陣就會乘以異質因子(即Pearson 卡方值與其自由度之比)。
十二、關於輸出結果缺少了最終選擇的模型的ROC曲線的問題:
因為outroc這個選項與aggregate scale這個選項有衝突,後者會改變了資料矩陣,因此使得最終的結果資料無法畫出選出模型的ROC曲線了;