1. Random Sampling in SAS:
1. sample sampling is often required in actual data processing. There are two main situations in practice:
(1) Simple sampling without repetition
(2) stratified sampling a. Proportional stratified sampling B. Unequal stratified sampling;
2. The proc suveryselect process can be used in SAS to achieve various sampling:
The general form is:
Proc surveyselect DATA = <Source Data Set Name> method = <srslursl sys> out = <dataset stored in the sample extraction> N = <number of samples> (or samprate = sampling ratio) seed = N;
Strata <specify the stratified variable>;
ID <specifies the source dataset variables retained by the extracted samples>;
Run;
Note: The method is used to specify the random sampling method. SRS indicates that simple random sampling (Simple Random samping) is disabled, and Urs refers to unrestricted random sampling ); sys refers to systematic sampling ). Seed is used to specify the number of random seeds, which is a non-negative integer. If the value is 0, the samples are different for each extraction. If the value is an integer greater than 0, the same sample can be obtained after the same value is input for the next sampling; ID is the variable used to copy data from the source dataset to the sample dataset. If the default value is used, all variables are copied.
3. Examples of simple and non-repetitive random sampling:
/* Extract samples from the test dataset at a rate of 30% and output the samples to the results dataset */
Proc surveyselect DATA = test1 out = results1 method = SRS samprate = 0.3;
Run;
4. Examples of proportional random sampling such as stratified sampling;
Proc sort data = Test2;
By stratified variables;
Run;/** sort the overall sample with a layered variable/
Proc surveyselect DATA = Test2 out = results2 method = SRS samprate = 0.1;
Strata layered variable;
Run;/* sample extraction from the population based on the percentage of stratified variables */
5. Examples of stratified proportional sampling;
(1) manually set the sampling ratio or number of samples
Proc sort data = test3;
By stratified variables;
Run;/** sort the overall sample with a layered variable/
Proc surveyselect DATA = test3 out = results3 method = SRS
Samprate = (0.1, 0.3, 0.5, 0.2);/* set the proportion to be extracted for each layer based on the layer */
Strata layered variable;
Run;/* sample extraction from the population according to the percentage of stratified variables */
Proc surveyselect DATA = test3 out = results3 method = SRS
N = (30,20, 50,40);/* set the number of samples to be extracted for each layer based on the hierarchy */
Strata layered variable;
Run;
(2) perform proportional sampling based on the sampling table
Proc sort data = test3;
By stratified variables;
Run;/** sort the overall sample with a layered variable/
Proc surveyselect DATA = test3 out = results3 method = SRS
Samprate = samp_table;/* sampling by sampling ratio dataset. samp_table data set should include stratified variables and the sampling ratio or quantity corresponding to each hierarchy, if the proportional sampling variable must use _ rate _ to name the sampling ratio, if the sampling variable is by quantity, use _ nsize _ to name the sampling quantity */
Strata layered variable;
Run;
6. For more information about the surveyselect process, see SAS help.
Enter help surveyselect in the command bar and press Enter.
2. Programs that divide data into training sets and test sets:
1. Data train (drop = u) Validate (drop = U );
Set develop1;
U = ranuni( 27513 );
If u <=. 67 then output train;
Else output validate;
Run;
A Data step is used to split de velop into train and validate.
Variable U is created using the ranuni function, which generates pseudo-random numbers from a uniform distribution on the interval (0, 1).
Ranuni function argument is an initialization seed. Using a participant
Number, greater than zero, will produce the same split each time the data
Step is run. If the seed was zero, then the data wocould be split differently each
Time the data step was run. The if statement puts approximately 67%
The data into train and approximately 33% of the data into validate
Because on average 33% of the values of a uniform random variable are
Greater than 0.67.
Iii. Use of the class option in SAS logistic:
The role of the class statement is to set virtual variables.
- If it is a binary classification variable, the classification effect is the same;
- Multiclass classification variables. If they are numeric and do not specify a category, SAS will process them as continuous variables (when classification variables are sorted classification variables, such as age groups and wage levels, it is recorded as 1, 2, 3, 4, and so on. This processing makes sense. If it is an unordered classification variable, there is no meaning for this processing.) If it is not a numeric type, SAS reports an error;
- To treat unordered multi-category variables, you must set them to classification variables. Otherwise, SAS reports an error;
SAS uses Param = ref for classification variables;
That is, the class variable response profile is
[1 0
0 1
0 0]
Form;
The parameter of the reference object is estimated to be 0. In this way, it is easier to estimate oddratios, and the benefit ratio of classification level vs reference level is = exp (this level parameter is estimated)
- If the ref parameter is not specified, SAS uses the maximum value as the reference level. For example, sex uses 1 as the reference level. If the param parameter is not set, SAS uses Param = effect by default;
Response Profile
[1 0
0 1
-1-1]
Form;
5. ref = check what you want to set to the reference level:
Ref = first, ref = last, or ref = "assign a value to a category", indicating that the first, last, or one of them is used as the reference group;
Param = ref mainly emphasizes parameter estimation, which is easier to calculate than oddsratio;
Param = effect focuses on hypothesis testing to facilitate interactive analysis.
In general, the two parameters are set in different encoding modes. Different settings are only used to facilitate the subsequent estimation or test calculation. The final calculation result should be the same.
Iv. Logistic in SAS
1. Calculate the score:
The score procedure multiplies values from two SAS data sets, one
Containing coefficients (score =) and the other containing the data to be
Scored (Data =). The data set to be scored typically wocould not have a target
Variable. The out = option specifies the name of The scored data set created
By proc score. The type = parms option is required for scoring
Regression models.
Proc Score data = read. New out = scored score = betas1
Type = parms;
VaR DDA ddabal Dep depamt cashbk checks;
Run;
Data can also be scored directly in Proc logistic using the output
Statement. This has several disadvantages over using proc score: it does
Not scale well with large data sets, it requires a target variable (or some
Proxy), and the adjustments for oversampling, discussed in a later section,
Are not automatically applied.
2. Fill in missing values:
The stdize procedure with the reponly option can be used to replace missing values. the method = option allows you to choose several different location measures such as the mean, median, and midrange. the output data set created by the out = option contains all the variables in the input data set where the variables listed in the VaR statement are imputed. only numeric input variables shocould be used in Proc stdize.
Proc stdize DATA = develop1 reponly method = median out = imputed;
VaR & inputs;
Run;
Proc print data = imputed (OBS = 20 );
VaR ccbal miccbal ccpurc miccpurc income miincome
Hmown mihmown;
Run;
Proc standard with the replace option can be used to replace missing values with the mean of that variable on the non-missing cases.
3. model verification and evaluation (ROC and logistic procedure verification ):
The inest = option on the proc logistic statement names the data set that contains initial parameter estimates for starting the iterative ML estimation algorithm. the maxiter = option in the model statement specifies the maximum number of iterations to perform. the combination of data = validation data, inest = final estimates from training data, and maxiter = 0 Causes proc logistic to score, not refit, the validation data. the offset = option is also needed since the offset variable was used when creating the final parameter estimates from the training data set.
The outroc = option creates an output data set with sensitivity (_ sensit _) and one minus specificity (_ 1mspec _) calculated for a full range of Cutoff probabilities (_ prob _). the other statistics in the outroc = data set are not useful when the data is oversampled. the two variables _ sensit _ and _ 1mspec _ in the outroc = data set are correct whether or not the validation data is oversampled. the Variable _ prob _ is correct, provided the inest = parameter estimates were corrected for oversampling using sampling weights. if they were not corrected or if they were corrected with an offset, then _ prob _ needs to be adjusted using the formula (shown in section 2.2 ).
Proc logistic DATA = validate des inest = Betas;
Model ins = & selected/maxiter = 0 outroc = ROC offset = off;
Run;
However, this model shocould be assessed using the validation data set because the risk of generating higher order terms may increase the risk of overfitting.
Proc logistic DATA = train1 des outest = Betas;
Model ins = miphone checks mM Cd brclus1 ddabal Teller
Savbal cashbk brclus3 acctage sav DDA ATM AMT
Phone inv ATM savbal * savbal ddabal * ddabal
Ddabal * savbal atm amt * ATM AMT
Savbal * DDA brclus1 * atm amt mm * savbal
Acctage * acctage miphone * brclus1
Checks * ddabal * phone ddabal * brclus3
Mm * phone sav * DDA mm * DDA
Cashbk * acctage;
Run;
Proc logistic DATA = validate des inest = Betas;
Model ins = miphone checks mM Cd brclus1 ddabal Teller
Savbal cashbk brclus3 acctage sav DDA ATM AMT
Phone inv ATM savbal * savbal ddabal * ddabal
Ddabal * savbal atm amt * ATM AMT
Savbal * DDA brclus1 * atm amt mm * savbal
Acctage * acctage miphone * brclus1
Checks * ddabal * phone ddabal * brclus3
Mm * phone sav * DDA mm * DDA
Cashbk * acctage/maxiter = 0;
Run;
4. cross-validation
(1) K-fold
% Let k = 5;
Data xx10f;
Do replicate = 1 to & K;
Do rec = 1 to numrecs;
Set mylib. Stu nobs = numrecs point = REC;
% Let M = floor (numrecs/& K );
/* If replicate ^ = rec then output ;*/
If replicate ^ = Ceil (REC/& M) then do;
New_y = y;
Selected = 1;
End;
Else Do;
New_y = .;
Selected = 0;
End;
Output;
End;
End;
Stop;
Run;
(2) LOOCV
Data xx;
Do replicate = 1 to numrecs;
Do rec = 1 to numrecs;
Set mylib. Stu nobs = numrecs point = REC;
/* If replicate ^ = rec then output ;*/
If replicate ^ = rec then new_y = y;
Else new_y = .;
Output;
End;
End;
Stop;
Run;
(3) bootstrapping
% Let k = 3;
% Let rate = % sysevalf (& k-1)/& K );
Proc surveyselect DATA = temp1 out = XV seed = 7589747 method = URS
Samprate = & rate outall rep = K
Run;
Data XV;
Set XV;
If selected then new_y = y;
Run;
5. When running a program, the log window always prompts that the operation is full and interrupted. What should I do?
(1) Option nonotes; enables SAS to not output notes.
You can also use proc printto; to specify the log Content to an external file:
* ** Point log to an external file .;
Proc printto log = "C: \ test.txt ";
Run;
-- Your program ---;
* ** Point the log to its default destination;
Proc printto; run;
(2) settings for not generating logs:
You can do,
Proc printto log = _ null _;
Run;
Proc print data = sashelp. Class;
Run;
% Put 'not thing show '_ all _;
Proc printto log = log;
Run;
Proc print data = sashelp. Class;
Run;
% Put _ all _;
(3) If no log is generated, run:
- Options nosource nonotes errors = 0;
Running Speed can be improved if no log is generated
6. Use the ROC curve to find a reasonable cut-off value ):
The ROC curve can combine sensitivity and specificity to find a boundary value, so that the sensitivity and specificity can be combined to the optimum. There are usually two ways: one is to calculate the highest point (sensitivity + specificity-1) based on each sensitivity and specificity as the boundary value; the second is to find a point closest to the upper left corner of the ROC curve as the boundary value. The points searched through these two methods are usually consistent.
VII. exist of Maximum Likelihood solution:
(1) The exact statement performs a precise test on the specified variable. This option can be used when the results are unstable when the number of examples is small and the model is small (The param = ref parameter must be added to the classification variable ), specify the variables to be accurately verified. If the sample size is large and the model is complex, execution of this statement will prompt insufficient memory;
(2) The strata statement is added after SAS 9.0 and is specially used for logistic regression analysis for matching design. This statement implements the analysis of 1:1, 1: M, M: N, and other multi-ratio data by using the proc Logistic Command. The strata statement mainly specifies the matching group variables. In case crossover study, each individual is a matching group. Therefore, the individual number is the matching group variable to be specified by the strata statement.
(3) For Versions later than sas9.2, you can use the Firth option to solve this problem.
8. stratified proportional sampling:
Use the strata statement in surveyselect to perform stratified sampling of variables. In this way, the ratio of 0 to 1 in the generated CV dataset is the same.
9. The "test global zero hypothesis" section in the logistic process is the overall test result of the model.
The results of the single-factor analysis of Logistic regression are consistent with those of the chi-square test. Some articles use Logistic regression for single-factor analysis, and some use the chi-square test for single-factor analysis. In fact, the results are the same.
10. Applicability of Rank Sum Test
If the two samples come from two independent but non-normal or non-clear populations, it is necessary to test whether the differences between the two samples are significant. T-test in the parameter test should not be used, rank Sum test is required.
Rank Sum Test
Application Conditions
① The overall distribution form is unknown or the distribution type is unknown;
② Biased distribution data:
③ Level Data: it cannot be accurately determined and can only be expressed by severity, level of merits and demerits, and order;
④ Data that does not meet the parameter test conditions: the variance of each group is obviously different.
⑤ One or both ends of the data are uncertain values, such as"
> 50 mg
.
11. Options aggregate scale = and rsquare after Model Statement In Proc logistic (reference: medical case statistical analysis and SAS application P172-178)
Model Y = chage rs2 rs3 lc mr/aggregate scale = none;
/* The options aggregate and scale output the Pearson card and deviance values for fit optimization evaluation. rsquare outputs the generalized R2. */
If the P values of deviance and Pearson Chi-square are low, the model is not fully fit. The Chi-squared values of deviance and Pearson are greater than 1, indicating that there may be an over-discretization. (That is, should the values of the two be smaller, and the P value be greater ?)
If the Pearson and deviance values are greater than 1 after the variables without statistical significance are removed, the existence of the discretization is indicated. We can use the Pearson card and deviance statistics for adjustment. Here we use Pearson Chi-square for adjustment. As long as the option is changed to scale = Pearson, the covariance matrix in the result is multiplied by the heterogeneous factor (that is, the ratio of Pearson Chi-square value to its Degree of Freedom ).
12. Question about the ROC curve of the final model missing from the output result:
Because the outroc option conflicts with the aggregate scale option, the latter changes the data matrix, so the final result data cannot draw the ROC curve of the selected model;