Exploratory analysis referred to as EDA
I. Basic DESCRIPTIVE statistics
1.summary function
Maximum, minimum, median, and mean values can be obtained
2. Four decimal points
The quaternion can be obtained by quantile function, and diff gets the difference of each sub-number.
> Library (RSADBE)
> Data ("Thewall")
> quantile (Thewall$score)
> diff (quantile (thewall$score))
3. Extremum
Range returns the maximum and minimum values
4. Very poor
diff (Range ()) returns a very poor
5. Four cent spacing
IQR function returns four-bit spacing
Two, stem leaf diagram and histogram
1. Stem and leaf diagram
You can use the stem function of the base pack and the Stem.leaf.backback function of the Alpack package
2. Histogram
Histograms can be implemented using the Hist function and the histogram function, and we use the Galton data as an example
> Data (Galton)
> par (mfrow=c (2,2))
> hist (galton$parent,breaks= "FD", xlab= "Height of the parent", main= "histogram for the parent Height with Freedman-diaconis Breaks ", Xlim=c (60,75))
> hist (galton$parent,xlab= "Height of the parent", main= "histogram for the parent Height with Sturges Breaks", Xlim=c (60,75))
> hist (galton$child,breaks= "FD", xlab= "Height of Child", main= "histogram for child Height with Freedman-diaconis Breaks ", Xlim=c (60,75))
> hist (galton$child,xlab= "Height of Child", main= "histogram for child Height with Sturges Breaks", Xlim=c (60,75))
In addition, there are options for setting the histogram, which apply to most graphical commands
Col: Graphic Color
Main: Graphics title
Xlab:x Axis Title
Ylab:y Axis Title
XLIM:X Shaft Range
YLIM:Y Shaft Range
Break: Sets the split range of the histogram
Freq: Logical option, true to generate frequency data, false to generate probability density data
Three, the density function diagram
Continuous random variables are described by the density function diagram, through the density () function, you can get the density estimation of the data, the result is a series of x and y coordinates, you can draw the density function graph through these coordinates, the format of the function is as follows
Density (data,bw= "", kernel= "", Na.rm=false)
Where data is required to be a vector type, bw is the density kernel estimate of the kernel, the smoothing type, the na.rm for the NA value, the default is not to remove the NA value, but if there is NA, the result will be an error.
The result of the density () function is a list type that can be selected by the $ symbol in the result variable.
You can draw a density function graph by combining plot (density ()) and add lines to the drawing through the lines () function.
Iv. Summary of data
1. Summary statistics of vectors
Max
Min
Length
Sum
Mean
Median
Sd
Var
Mad: Getting the median absolute deviation
Summary: Gets the maximum, minimum, median, and mean values
Quantile: Gets the number of bits, the default is to return four cents, you can modify
Fivenum: Get minimum, four-bit low, median, four-bit high, maximum value
Cumsum: Cumulative Total
Cummax: Cumulative Maximum Value
Cummin: Cumulative Minimum value
Cumprod: Tired Ride
If an NA value is present in the vector, then the NA value is eventually returned, and the NA value can be omitted using the option Na.rm=true, where the length function has no na.rm option and can be processed first using the Na.omit () function, such as length (Na.omit (data))
2. Summary statistics of data frames
Max
Min
Sum
Fivenum
Length: Returns the number of columns in the data frame
Summary: Returns the descriptive value of each column
Rowmeans
Rowsums
Colmeans
Colsums
Apply: You can combine the above commands in the form of apply (X,margin,fun ...) Where margin is 1 or 2, 1 means row, 2 means the column fun is calculated, and you can define na.rm=true to ignore NA value
Prop.table (Data,margin=1, 2,fun): Returns the proportion of each value, the default is the total, the margin=1 is the row ratio, the margin=2 is the column ratio, the fun is the set function,
Addmargins (Data,margin=1, 2,fun): Returns the value of a row or column calculated from the fun, similar to prop.table, except that it is not a percentage, but the actual calculated value.
3. Summary statistics of matrices
Max
Min
Sum
Fivenum
Length: Returns the number of cells in the matrix
Summary: Returns the descriptive value of each column
Mean (data[,2]): Calculates the mean value of the second column
Rowmeans
Rowsums
Colmeans
Colsums
Apply
Prop.table (Data,margin=1, 2): Returns the percentage of each value, the default is the total, the margin=1 is the row, and the margin=2 is the column percentage.
The matrix is similar to a data frame, but the difference is that the matrix is a whole and cannot be selected with a single column, so functions like sum are computed for the whole matrix, can be selected using [], and the rest of the functions are similar in basic usage to data frames.
4. Summary statistics of the list
Max (Data$var)
Min (Data$var)
SUM (Data$var)
Fivenum (Data$var)
Length: Returns the number of cells in the matrix
Summary: Returns the descriptive value of each column
Mean (data[,2]): Calculates the mean value of the second column
Lapply: Result of output list type
Sapply: Result of output matrix type
Overall, the summary statistics function of the list is similar to the other data structures, but the difference is that each uses the $ specified variable, and the Apply () function cannot use the list, it needs to use its variants lapply and sapply, both of which are just the data types of the output result.
5. Summary statistics of tables
Max
Min
Sum
Fivenum
Length: Returns the number of columns in the data frame
Summary: Returns the descriptive value of each column
Rowmeans
Rowsums
Colmeans
Colsums
Apply
Prop.table (Data,margin=1, 2): Returns the percentage of each value, the default is the total, the margin=1 is the row, and the margin=2 is the column percentage.
Descriptive and exploratory analysis of R language