5.1R built-in distribution
Distribution is the core and most important way to describe a sample data. R incorporates a number of commonly used statistical distributions and provides four types of functions: probability density function (density), cumulative distribution function (probability), Division number (quantile), and pseudo-random number (random). In R, each of the 4 items is represented by D,p,q,r, followed by the English name or abbreviation of the distribution.
Analysis of trends in episode 5.2
Measurement of 5.2.1 concentration trend
The indicators that describe the trend of statistical distribution are mainly average, median, and majority, also known as the average indicator. The main functions of these indicators include:
reflect the concentration trend and general level of the distribution of the overall units variables;
It is easy to compare the level of similar phenomena between different units;
It is convenient to compare the development trend or law of similar phenomena in different periods;
A dependency that is used to analyze the question of phenomena.
5.2.2 R Language Implementation
The function Summary () calculates the five-and mean-values of a set of data.
>summary (Cars$speed)
Min.1stqu.medianmean3rdqu.max.
4.012.015.015.419.025.0
5.3 Analysis of discrete trends
Measure of 5.3.1 Discrete trend
The degree of dispersion of the data is mainly measured by the statistical index of the difference, four difference, average difference, variance, standard, etc. In the actual analysis, the dispersion degree analysis mainly has the following functions:
To measure the representativeness of average indicators;
Reflect the balance of social and economic activities;
To study the situation of the distribution of the overall flag value deviating from normal state;
A basic indicator of statistical analysis such as sampling inference.
5.3.2 R Language Implementation
The extreme difference can be calculated from the function range (). Give a minimum and a maximum of two points, then subtract from it:
>m=range (Cars$speed)
>M[2]-M[1]
[1]21
The four-point difference also requires manual calculation, the more convenient way is to directly use the function Fivenum ()
>q=fivenum (Cars$speed)
>Q[4]-Q[2]
[1]7
The variance function in R and the standard deviation function are VAR () and SD () R also have a special function, that is, the dispersion mad (), which is used to calculate the median absolute deviation, with asymptotically normal consistency.
5.4 Analysis of data distribution
Measurement of distribution of 5.4.1
(1) Degree of skewness
(2) Peak degree
5.4.2R Language Implementation
In package timedate (or directly loading the Fbasics package), there are functions that directly calculate skewness and kurtosis coefficients for skewness () and kurtosis ()
>skewness (Cars$speed)
[1]-0.1105533
attr (, "method")
[1] "moment"
>kurtosis (Cars$speed)
[1]-0.6730924
attr (, "method")
[1] "excess"
5.5 Graphical analysis and R implementation
5.5.1 histogram and density function graphs
>hist (cars$speed,breaks=50,prob=t) #参数breaks设1直方图的组距, prob=t specifies the density histogram
>lines (Density (cars$speed), col= ' Blue ') #用核密度估计函数density (), plot the density graph
5.5.2 QQ Map
QQ graphs are used to visually verify whether a set of data comes from a distribution, or to verify that two sets of data are from the same family. In the teaching and software commonly used QQ Scatter chart to check whether the data from the normal distribution. QQ graph is the normal quantile-quantile graph, the horizontal axis is the theoretical value, the longitudinal shaft is the sample value, if the sample data approximate to obey the normal distribution, then the QQ map scatter should be evenly distributed around the line y=xσ+μ, the slope of the line is normally distributed
Standard deviation J, intercept is mean-value knife.
>qqnorm (Cars$speed)
>qqline (Cars$speed)
5.5.3 stem and leaf diagram
Using function stem () to draw stem and leaf plots in R
Stem (x,scale=1,width=80,atom=1e-08)
where x is the data vector, the scale controls the length of the stem and leaf graph, width controls the drawing's widths, and atom is the tolerance.
> Set.seed (111)
> S=sample (cars$speed,25)
> Stem (s)
The decimal point was 1 digit (s) to the right of the |
0 | 44
0 | 779
1 | 011233344
1 | 5557889
2 | 0344
5.5.4 Box Line diagram
> BoxPlot (cars$speed)
5.5.5 Experience Distribution Map
The function ecdf () in R gives the empirical distribution of the sample, plotted through plot ()
ECDF (x)
Plot (x,...,ylab= "Fn (x)", verticals=false,col.01line= "Gray70", peh=19)
"Data Analysis R Language Practice" study notes the descriptive analysis of the data in the fifth chapter (Part I)