SAS Learning Notes

Last Update:2018-07-23 Source: Internet

Author: User

Tags generator square root

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently in doing the SAS conversion code, learned before, but the time has been forgotten, in order to easily find later, so again to review the time, the SAS study carried out a simple record.

Common syntax structure of SAS

The basic type of the SAS variable

First: Numerical type

Second: Character variable name must use $ descriptor

Other data types, date, time and other variables stored as numeric, you can make arbitrary integers, fixed-point real, floating-point real numbers, and so on, generally use 8 bytes. The default length of a character variable is 8 characters, and of course it can be specified by the length statement. Length character variable name $ lengths

Two major steps to data processing: step and process

The data step starts with the proc and ends with the run statement, and the procedure begins with a start and ends with the run statement.

Libname Library Tag ' folder location ' option;

Suppose there is a folder "user" under the C disk directory, and there is a SAS dataset under this folder as AA.
So that's the way to do it:
Libname a ' c:\user ';
Proc Print Data=a.aa;
Run

The database is divided into two permanent and temporary libraries: there is only one temporary library: named work, and there are multiple permanent libraries.
Specify Two library Tags:
Sasuser:
WORK:

Each dataset has a two-level name, the first level is the library tag, and the second level is the dataset name
The format is as follows: library tags. Data set name

The data statement is used to create and process datasets. The function has two:
First: Sign data to start at a pace
Second: Name the SAS dataset that will be created
Format: Data DataSet name;

INFILE is used to read data from an external file and must appear before the input statement, with the following main features:
First: Determine an external text file containing the original data
Format: INFILE ' location and name of external file ' option;

Cards statement, which is used to enter data directly, marking the beginning of a block of data.
CARDS:
Data blocks
;

Input statement, used to show the system how to read each record, the main features:
Reads the data column specified by the statement
Define variable names for the corresponding data fields
Determining variable Read mode

Format:
INPUT variable name [number of variable type] ...;

Two ways to input data: direct input and external input

SAS program block
Programs always start with Proc, followed by the name of the program step
The common steps are as follows:
Sort sorts the specified dataset by the specified variable
Print prints the list of data in a dataset
MEANS a simple statistical description of the specified numeric variable
Freq a simple statistical description of the specified classification variable
TTEST a T-Test on a specified variable
ANOVA variance Analysis of the specified variables
Npar1way the specified variable to a nonparametric checksum
REG makes regression analysis of the specified variables
CORR makes an analysis of the specified variables
CHART to draw a low resolution statistical chart
SQL calls SQL language

Format:
PROC procedure name [data= DataSet name] [option];
The special statement description of the process;
[VAR variable sequence;]
[WHERE condition expression ...;]
[by variable sequence;]
RUN;

Various common random functions of SAS

Name of random number function	Role
Uniform (Seed)	Generate (0,1) region evenly distributed random number, multiply the same generator
Ranuni (Seed)	Generate (0,1) region uniform distribution random number, Prime mode generator
NORMAL (Seed)	Generating standard normal distribution random numbers, using the approximate formula of central limit theorem
Rannor (Seed)	Generating standard normal distribution random numbers, using the transform sampling method
Ranexp (Seed)	Random numbers of exponential distributions producing λ=1
Rangam (Seed,alpha)	Produces a gamma distribution random number, alpha>0,seed as any value
Rantri (SEED,H)	Produces a triangular distribution random number, 0
Rancau (Seed)	Generating standard Cauchy distribution random numbers
Ranbin (SEED,N,P)	produces two-item distributed random numbers, n>0 integers, 0<p<1,seed as arbitrary values
Ranpoi (SEED,LAMBDA)	Generate Poisson distribution random number, lambda>0,seed as arbitrary value
RANTBL (seed,p1,..., p2,... pn)	Generate discrete random numbers, 0≤pi≤1,seed as arbitrary values

Spoke statements

If condition then statement;

If condition then do statement 1;. Statement N; End

If condition then statement; else statement;

Multi-Branch Structure:

Select (choose an expression);

When (value list) statement;

.....

otherwise statement;

End

Select

When (condition) statement;

.....

otherwise statement;

End

SAS Process Step Common command

Title Specify Caption

var Specifies the analysis variable

class to specify a category variable

Model to specify the form

means produces simple statistics

Plot plot scatter plot

Print lists the contents of the dataset

Sort sorted by variable value

GROUP by specified variable

Output a dataset that is stored with the specified result

Freq Specify a duplicate variable

Where to select a subset of the dataset

Label Temp Label

Here we give several common usage formats for retain statements:

L Retain;

L Retain T1 T2 T3;

L Retain T1 T2 T3 100;

L Retain T1 T2 T3 (100);

L Retain T1 T2 T3 (100 99 98);

The first one uses formatting to indicate that all variables created with an input statement or assignment statement are preserved from this execution of the data step to the next repeat. The second uses the format to specify variable names, variable lists, or array names that the user wants to keep. The third uses a format to represent a list of variables T1, T2, T3 accept the same initial value of 100. The fourth use format is to enclose the initial value 100 in parentheses, and the SAS system assigns this value in parentheses to the first variable in the list of variables, t1=100, other T2 and T3 as missing values. The fifth use format gives a list of the initial values, which assigns the values in the initializer list to each variable in the list of variables, that is, t1=100,t2=99,t3=98.

in= Options

Format: Sas-data-set (in=variable)

A variable of a temporary numeric type with a value of 0 or 1

In option, when you read more than one SAS dataset, use the in option to determine that this observation data is from that dataset.

Variable=0 says the observation is not from this dataset.

Variable=1 indicates that the observations are from this dataset

SAS Clustering Analysis
PROC CLUSTER < DataSet name > Method=name <options>;//must statement
Var
COPY
Rmsstd
Id
By
Freq
The above is a selectable statement

If statement
Data
If X>y Then
Put "x>y";
Else
If X<y Then
Put "x<y";
Else
Put "x=y";
Run

For statement
Do count variable = initial value to terminate value by step;
....;
End

Do while (cyclic conditional expression);
.....;
End

Do until (cyclic conditional expression);
.....;
End

= EQ
< LT
> GT
<= Le
>= GE
<> NE

The Set statement represents reading data from that dataset
Format: set DataSet name;

Delete/Add variable or observation value
Data car;
Set Sasuser. Car;
Drop ID;
Run

Data car;
Set Sasuser. Car;
Keep Q1 B1 B2;
Run

Data splitting

Data Sasuser. Car_low Sasuser. Car_high;
Set Sasuser.car;
Select
When (b1<=2) output sasuser. Car_low;
Otherwise output sasuser. Car_high;
End
Run

There are two types of data consolidation: Portrait and landscape
Data Sasuser. Car_total;
Set Sasuser.car_low Sasuser.car_high;
Run

Proc Sort=sasuser.student_profile;
by ID;
Run
Proc Sort=sasuser.student_score;
by ID;
Run
Data sasuser.student_total;
Merge Sasuser.student_profile Sasuser.student_score;
by ID;
Run

Proc Print;run;

Data cleaning
Generally in the way of proc SQL
Data revision
Data standardization
Data scale transformations such as: raw data is very much changed into hundred
Data Updates Update

Proc Rank in fact, the main thing is to master those options, the proc's overall grammatical structure is as follows:

Proc Rank < options >;
var variable;
Ranks new variable name;
by grouping variables;
Run

This is a whole grammatical structure, for example, I want to rank the height of the sashelp in:

Proc Rank Data=sashelp.class Out=result;
var height;
Run
This is clearly the case, with Var to specify the position of the variable, but you run the program will find a problem, that is the original height of the value is replaced. If I want to keep the original height value, I need to use the ranks statement:
Proc Rank Data=sashelp.class Out=result;
var height;
Ranks R_height;
Run
So the original height variable will not move, generate rank variable r_height, this is the role of ranks.

Example:

Proc Rank descending out=oe11;

var compsit;

Ranks Rankcompsit;

Proc sort; by Rankcompist;

Proc Print data=oe11;

Run

Mean value
Data
X=mean (89,90,78,98,87,76,69,90,92,88);
Put ' mean value = ' x;
Run

There are two concepts in mean statistics: truncated mean and reduced-end mean value
The former is to remove the maximum n and the least N value after the average
The latter is the smallest n value replaced by the number of the n+1 small, and the largest n value is replaced with the number n+1, and then the mean value is calculated.
Calculation of the above two mean values with univariate process
Data null;
Input score@@;
Cards
89 90 78 98 87 76 69 90 92 88
;
Proc Univariate data=null trimmed=2 winsorized=2;
VAR score;
Run

Median number
If the number of data is odd, the median is a value that is in a positive center, and if the number of data is even, the median is the average of two data at a positive central position.
Data
X=median (89,90,78,98,87,76,69,90,92,88);
Put ' median = ' x;
Run

The number of decimal points
is divided into equal parts. In practical application, the four-point application is the most extensive.
In SAS, Q1 says that 25%,q3 is in the 75%
Data null;
Input score@@;
Cards
89 90 78 98 87 76 69 90 92 88
;
Proc means data=null Q3 Q1;
VAR score;
Run

Public numbers
Refers to the number of occurrences in the data.
Data null;
Input score@@;
Cards
190 188 188 185 183 183 180 180
180 180 177 175 175 174 173
;
Proc Univariate data=null Modes;
VAR score;
Run

Degree of dispersion
Focusing on trends and summarizing data makes it possible to make a preliminary impression on data, but these indicators are highly abstract and ignore the necessary data information to make
In some cases, people can only see the false appearance of the data, but can not understand its true intrinsic meaning.
Indicators are differential, mean, four, difference, variance, standard deviation, standard error and discrete coefficient

Differential
Is the difference between the maximum value of the data minus the minimum value.
Data
X=range (89,90,78,98,87,76,69,90,92,88);
Put ' differential = ' x;
Run

Four-point difference
The difference between the 3rd four-digit number minus the 1-four-point number
The smaller the value, the more centralized the data in the middle
Data null;
Input score@@;
Cards
190 188 188 185 183 183 180 180
180 180 178 177 175 175 174 173
;
Proc means data=null Qrange;
VAR score;
Run

Variance

Calculated using the whole data, so it reflects the average degree of divergence of all data relative to the datacenter. The arithmetic square root of variance is the standard deviation.
Data

X=var (89,90,78,98,87,76,69,90,92,88);

Y=STD (89,90,78,98,87,76,69,90,92,88);

Put ' variance = ' x ' standard deviation = ' Y;

Run

Standard error
The standard deviation of its sample mean. In the data sampling, due to the randomness of the existence of the same sampling method in different time, place, environment and other conditions for multiple sampling, may get a number of different sample data.

The standard error is not the actual error of the observation value, nor the error range, it is only an estimate of the reliability of a set of observational data. The smaller the standard error, the greater the reliability of the observation, the contrary is not reliable. Most applications use standard error to evaluate the measurement accuracy of data.

Data

X=stderr (89,90,78,98,87,76,69,90,92,88);

Put ' standard error = ' x;

Run

Coefficient

The variation coefficient is an important index to measure the relative dispersion degree. It refers to the ratio of the standard deviation of a set of data to its corresponding mean, the smaller the coefficient of variation, the smaller the discrete degree of the data.

Give an example of how height and weight differences are compared. The direct use of standard deviation can not be compared, because the standard deviation is with units, the unit of height is centimeter, the unit of weight is kg, the statistics of different units is not meaningful. Therefore, the coefficient of variation is used to eliminate the influence of dimension, i.e. measurement unit.

Data

HEIGHT=CV (157,170,161,184,184,168,166,158,174,166,173,189,188,163,161,189,183,186,188,155);

WEIGHT=CV (61,71,74,59,71,55,73,66,50,57,48,68,73,52,60,56,53,67,73,64);

The variation coefficient of ' height ' = ' height ' weight variation factor = ' weight;

If Height>weight Then

Put ' height is bigger than the weight difference ';

Else

If Height<weight Then

Put ' height is smaller than weight difference ';

Else

Put ' height is quite different than weight ';

Run

Distribute shapes

In the general analysis of the data, there are two important aspects to study the concentration trend and the degree of dispersion, but it is not the only one. Just like a person evaluation, not only to investigate the situation of people's height, but also to examine the situation of fat and thin, more to see whether a person is standing there associate station, sit and sit phase. For the distribution of data, but also to carry out a general analysis, in order to grasp the full picture of the data.

The measurement of data distribution mainly investigates the skewness and flatness of the data distribution, and whether the data distribution is symmetrical, and its index mainly has two types of skewness and kurtosis.

Degree of bias

is a measure of the symmetry of the data distribution. The calculation method of the third-order central moment is usually used, and the ratio of the sum of three to the standard deviation of the three-th is mainly investigated.

If the data is symmetric, the skewness is equal to 0; If the skewness is obviously not equal to 0, it indicates that the data distribution is asymmetrical, specifically, the skewness is greater than 0 o'clock, the data on the right side of the mean is more dispersed, indicating that the data is on the right side;

Data
X=skewness (190,188,188,185,183,183,180,180,180,180,178,177,175,175,174,173);
Put ' skewness = ' x;
Run

Peak degree

An indicator used to reflect the steep or partial level of the top of a data distribution curve. The steep or flattened point is the standard normal distribution.

Data
X=kurtosis (190,188,188,185,183,183,180,180,180,180,178,177,175,175,174,173);
Put ' kurtosis = ' x;
Run

SAS uses freq, MEANS, univariate process to describe statistical analysis.

Proc Freq < options >;

By variable/variable list;

Exact statistics options </calculation options >;

Cond..................

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More